Summer 2004 Beowulf Benchmarking Project

Summer 2004 Beowulf Benchmarking Project

Results

LinPack

The maximum matrix size as limited by memory capacity:

GRAPHIC LOAD ERROR

= 46341 matrix size

using 80% of the clusters memory (leaving enough for X functions) the matrix size is 41449.

GOTO -- 16 Processor Network Test
Network	GFlops	Peak %
One Network	25.42	56.7
Both Network	29.09	64.9

For the final (whole network) test with matrix size 41449, a different machine file was used, which declared all of the processors using an independent IP address, instead of linking two processors to each IP. This was done based on an earlier test that indicated that the resulting GFLOP speed was greater if each processor used it's own memory and was addressed by its own Ethernet connection.

Using sixteen 1.4 Ghz processors, R_peak can be calculated as:

2 flops × 1.4 × 10⁹ Hz × 16 processors = 44.8 GFlops The clusters maximum Benchmark result was 29.09 GFlops, providing us with approximately 64.9% efficiency.

STREAM

This measures memory bandwidth for one processor.

Array size = 64 × 10⁶
Memory required = 1464.8 MB
Function	Rate (MB/s)	RMS time	Min time	Max Time
Copy	1493.9593	0.7164	0.6854	0.7868
Scale	1410.6373	0.7675	0.7259	0.8387
Add	1476.3721	1.0785	1.0404	1.1851
Triad	1446.4082	1.1640	1.0619	1.3404

NetPerf 2.2

This measures the communication bandwidth of the entire network.

MB/s transfer rate		Receive
MB/s transfer rate		First	Second
Send	First	938.25	94.11
Send	Second	934.33	94.15

Procedure

The following benchmarks were administered on an 8-node beowulf cluster of the following specifications:

Item	Quantity
K8D motherboard MSI	8
240 AMD CPU Opteron	16
512 MB ATP ECC PC2700	32
D-DGS Link - 1008 ethernet gigabit switch	2
Barracuda Gbyte HD Seagate	4

These nodes are diskless, one node acts as a file server for the other seven. The cluster is operated by GinGin64, a 64-bit implementation of Red Hat Linux 8.

More information about Cluster Design here.

High Performance LinPack

Obtained LinPack Source code: http://www.netlib.org/benchmark/hpl/
Created a Linux_Opteron makefile that sufficiently represented the parameters of the cluster
- mpich 1.2.5
- acml 2.0 / ATLAS 3.5.6 / GOTO
- gcc 3.3
Recorded benchmark test outputs as developed by command-line execution of the following statements:
- <terminal> mpirun -np <number of processors> xhpl &
- <terminal> do-lab -o renice +19 -u <login_name>
  *Please note do-lab is not a standard Red Hat command*

The majority of the Benchmark testing was done using the ACML 2.0 math libraries. The performance of the other libraries can then be extrapolated on the basis of the ACML results. The matrix sizes 16384 and 32768 consumed 12.5% and 50% of the clusters memory, respectively. The somewhat unsatisfactory results may be due in part to the fact that an insuffienct problem size has yet been used (as a parallel to later recorded findings which indicated that better results are seen when the problem size is sufficiently large as to compose >80% of the available memory.

ACML Driver Benchmark Results
Matrix Size	PxQ	Block Size	GFlops
16384	2x8	32	7.836
		64	4.573
		128	4.565
		256	5.535
		512	4.397
		1024	8.293
	4x4	32	5.883
		64	5.035
		128	8.067
		256	7.719
		512	7.057
		1024	5.568
32768	2x8	32	8.164
		64	9.495
		128	10.41
		256	10.68
		512	10.41
		1024	---
	4x4	32	9.705
		64	11.27
		128	12.58
		256	12.95
		512	11.69
		1024	---

The following graphs were constructed using the data collected from the previous table.

The next step is to test the three math libraries that we have available. Using a 15,400 matrix problem size (roughly 95% of one nodes memory capacity) the following data was compiled.

Single Processor Results
Math Library	GFlops	Peak %
ACML 2.0	2.271	81.1
ATLAS 3.5.6	2.435	87.0
GOTO	2.489	88.9

Next, having seen the superiority of the GOTO libraries, one last set of tests is administered. The cluster has two network switches, one of which is used for NFS and tftpboot, as well as the mpirun executions, and one that does nothing at all. After configuring the second network into the cluster (as an independent connection, not bonded), tests were run on a 4 processor section of the cluster, matrix problem size 20,000 (or roughly 80% of two nodes memory capacity).

GOTO -- 4 Processor Network Test
Network	GFlops	Peak %
Primary	7.785	69.5
Secondary	8.453	75.5

STREAM

Obtain STREAM source code: http://www.cs.virginia.edu/stream/
Compile source code
Recorded Benchmark results as developed by command-line execution of the following statement:
- <terminal> stream_d

The STREAM benchmark is a measure of a clusters ability to communicate. The Triad Rate from our benchmark was 1446.4082 MB/s, this gives us the transfer bandwidth between the memory and the processor.

NetPerf 2.2

The NetPerf test is designed to test the transfer speed of a connection between two computers. This only test the speed on one NIC to another. While executing this benchmark, a problem with the coding of the benchmark occured. It is not possible using NetPerf to route the test through secondary network cards, or secondary network structures, so the results that were obtained from tests sending to the second network are grossly erroneous. However, the primary network results are impressive, yielding 93.82% efficiency.

Conclusion

The last thing that I did this summer was to test the running times of the code that Dr. van de Sande and Justin Lambright had written during their summer work. I ran the code and recorded the time that the program output. It was interesting to see that the best results were not obtained in the same way that the optimum benchmark results were reached. In this case, the best results were gotten by using all sixteen processors, but only one network (not both, as was the case with the LinPack).

GRPHIC LOAD ERROR

Color	Processors	Networks
Red	1	1
Pink	2	1
Blue	4	1
Perriwinkle	8	1
Yellow	16	1
Green	16	2

Brett van de Sande,
Jeremy Pace
Homepage: http://www.geneva.edu/~bvds and
E-mail: bvds@geneva.edu