Summer 2004 Beowulf Benchmarking Project

Results

LinPack

The maximum matrix size as limited by memory capacity:

GRAPHIC LOAD ERROR = 46341 matrix size

using 80% of the clusters memory (leaving enough for X functions) the matrix size is 41449.
GOTO -- 16 Processor Network Test
NetworkGFlopsPeak %
One Network25.42 56.7
Both Network29.09 64.9


For the final (whole network) test with matrix size 41449, a different machine file was used, which declared all of the processors using an independent IP address, instead of linking two processors to each IP. This was done based on an earlier test that indicated that the resulting GFLOP speed was greater if each processor used it's own memory and was addressed by its own Ethernet connection.


Using sixteen 1.4 Ghz processors, Rpeak can be calculated as:

2 flops × 1.4 × 109 Hz × 16 processors = 44.8 GFlops The clusters maximum Benchmark result was 29.09 GFlops, providing us with approximately 64.9% efficiency.

STREAM

This measures memory bandwidth for one processor.
Array size = 64 × 106
Memory required = 1464.8 MB
FunctionRate (MB/s)RMS timeMin time Max Time
Copy1493.95930.71640.6854 0.7868
Scale1410.63730.76750.7259 0.8387
Add1476.37211.07851.0404 1.1851
Triad1446.40821.1640 1.06191.3404


NetPerf 2.2

This measures the communication bandwidth of the entire network.
MB/s transfer rate Receive
FirstSecond
SendFirst938.25 94.11
Second934.3394.15


Procedure


The following benchmarks were administered on an 8-node beowulf cluster of the following specifications:
ItemQuantity
K8D motherboard MSI8
240 AMD CPU Opteron16
512 MB ATP ECC PC270032
D-DGS Link - 1008 ethernet gigabit switch 2
Barracuda Gbyte HD Seagate4

These nodes are diskless, one node acts as a file server for the other seven. The cluster is operated by GinGin64, a 64-bit implementation of Red Hat Linux 8.
More information about Cluster Design here.

High Performance LinPack

  1. Obtained LinPack Source code: http://www.netlib.org/benchmark/hpl/
  2. Created a Linux_Opteron makefile that sufficiently represented the parameters of the cluster
    • mpich 1.2.5
    • acml 2.0 / ATLAS 3.5.6 / GOTO
    • gcc 3.3
  3. Recorded benchmark test outputs as developed by command-line execution of the following statements:
    • <terminal> mpirun -np <number of processors> xhpl &
    • <terminal> do-lab -o renice +19 -u <login_name>
      *Please note do-lab is not a standard Red Hat command*
The majority of the Benchmark testing was done using the ACML 2.0 math libraries. The performance of the other libraries can then be extrapolated on the basis of the ACML results. The matrix sizes 16384 and 32768 consumed 12.5% and 50% of the clusters memory, respectively. The somewhat unsatisfactory results may be due in part to the fact that an insuffienct problem size has yet been used (as a parallel to later recorded findings which indicated that better results are seen when the problem size is sufficiently large as to compose >80% of the available memory.


ACML Driver Benchmark Results
Matrix SizePxQBlock SizeGFlops
163842x8327.836
644.573
1284.565
2565.535
5124.397
10248.293
4x4325.883
645.035
1288.067
2567.719
5127.057
10245.568
327682x8328.164
649.495
12810.41
25610.68
51210.41
1024---
4x4329.705
6411.27
12812.58
25612.95
51211.69
1024---


The following graphs were constructed using the data collected from the previous table.


GRAPHIC LOAD ERROR GRAPHIC LOAD ERROR
GRAPHIC LOAD ERROR GRAPHIC LOAD ERROR
GRAPHIC LOAD ERROR


The next step is to test the three math libraries that we have available. Using a 15,400 matrix problem size (roughly 95% of one nodes memory capacity) the following data was compiled.


Single Processor Results
Math LibraryGFlops Peak %
ACML 2.02.27181.1
ATLAS 3.5.62.43587.0
GOTO2.48988.9


Next, having seen the superiority of the GOTO libraries, one last set of tests is administered. The cluster has two network switches, one of which is used for NFS and tftpboot, as well as the mpirun executions, and one that does nothing at all. After configuring the second network into the cluster (as an independent connection, not bonded), tests were run on a 4 processor section of the cluster, matrix problem size 20,000 (or roughly 80% of two nodes memory capacity).


GOTO -- 4 Processor Network Test
NetworkGFlopsPeak %
Primary7.78569.5
Secondary8.453 75.5


STREAM

  1. Obtain STREAM source code: http://www.cs.virginia.edu/stream/
  2. Compile source code
  3. Recorded Benchmark results as developed by command-line execution of the following statement:
    • <terminal> stream_d
The STREAM benchmark is a measure of a clusters ability to communicate. The Triad Rate from our benchmark was 1446.4082 MB/s, this gives us the transfer bandwidth between the memory and the processor.


NetPerf 2.2



The NetPerf test is designed to test the transfer speed of a connection between two computers. This only test the speed on one NIC to another. While executing this benchmark, a problem with the coding of the benchmark occured. It is not possible using NetPerf to route the test through secondary network cards, or secondary network structures, so the results that were obtained from tests sending to the second network are grossly erroneous. However, the primary network results are impressive, yielding 93.82% efficiency.


Conclusion

The last thing that I did this summer was to test the running times of the code that Dr. van de Sande and Justin Lambright had written during their summer work. I ran the code and recorded the time that the program output. It was interesting to see that the best results were not obtained in the same way that the optimum benchmark results were reached. In this case, the best results were gotten by using all sixteen processors, but only one network (not both, as was the case with the LinPack).

GRPHIC LOAD ERROR

ColorProcessorsNetworks
Red11
Pink21
Blue41
Perriwinkle81
Yellow161
Green162



Brett van de Sande,
Jeremy Pace
Homepage: http://www.geneva.edu/~bvds and
E-mail: bvds@geneva.edu

Valid HTML 4.01!