Diagonalization of sparse symmetric matrices on parallel computers

This project focuses on some more computational aspects of the transverse lattice. From a computational point of view, we must calculate the lowest eigenvalues and eigenvectors of very large matrices. We are developing computer code that can do these computations efficiently on a large parallel computer. Our application for computing resources at PSC describes the status in 1999.

- Message passing is accomplished using
MPI.
- Local machines will run the MPICH implementation of the protocol.
- We might also look into using the LAM implementation.
- Here is a local copy of the MPI 1.1 standard and errata.

- We have two computers for local code development.
Each machine has:
- Motherboard: Tyan Tiger 100, S1832DL. The new Tiger 133, S1834 was not available on time.
- CPU: dual Intel Pentium II, 400 Mhz, 512 kbyte cache.
- Memory: 256 Mbyte PC100 ECC SDRAM.
- Network: 100 Mbit/s ethernet card.
- Other junk: monitors, disk drives,
*et cetera*.

- Large computations are performed at the Pittsburgh Supercomputing Center, (T3E).
- Lanczos Methods (Symmetric/Hermitian)
**LANCZOS**Cullum and Willoughby's symmetric Lanczos package-
**LANZ**Jones and Patrick's symmetric Lanczos packpage **LASO**D. Scott's symmetric Lanczos packpage- LANSO Parlett and Co.'s symmetric Lanczos package
- PLANSO: parallel version of LANSO by Wu and Simon
- TRLAN: Thick-Restart Lanczos by Wu and Simon
- IRLM in ARPACK
- PARPACK: parallel version of ARPACK is designed to solve large sparse eigensystems on a parallel machine. Unfortunately, the Arnoldi method is less efficient than Lanczos for symmetric eigenvalue problems.
- BLZPACK: Block Lanczos not parallel, but does have complex version.

- Some packages that handle parallel-sparse matrices: These are mostly linear equation solvers. The ScaLAPACK Project contains several packages of interest, including PARPACK and ScaLAPACK itself. These are discussed in a review by Victor Eijkhout

`PLANDR2()`

function of the PLANSO
package(Kesheng Wu and
Horst Simon) to run on
both a CRAY T3E and on a cluster of PC's.
The T3E machine is a Cray T3E-900 located at PSC Pittsburgh with 512 application PE's which are 450Mhz Alpha processors each with 128MB of RAM. The PC cluster, which are Intel based PC's, consists of 2 dual processor 400Mhz PentiumII machines and one single processor 400Mhz PentiumII machine. All three machines have 256MB of RAM.

**PC cluster** (to top)

On the PC cluster, using the 5 available PE's, we have successfully diagonalized a
sparse symmetric matrix with an order of 282,274 and having 13,194,799
non-zero elements.

**Info we care about:**

Average CPU usage | 60-80% |

Memory usage | approx 100MB |

Approx avg diagonalizing time/PE | 209 seconds |

Lanczos steps taken | 124 steps |

# of Ritz values found | 5 values |

Error bound | less than +/- 1.0E-09 |

Also on the PC cluster we have diagonalized a slightly smaller matrix with an order of 206,622, but containing more non-zero elements at 15,474,058.

**Info we care about**

Average CPU usage | 60-80% |

Memory usage | approx 97MB |

Approx avg diagonalizing time/PE | 189 seconds |

Lanczos steps taken | 134 steps |

# of Ritz values found | 5 values |

Error bound | less than +/- 1.0E-09 |

**T3E** (to top)

On the Cray T3E at Pittsburgh, we have so far successfully diagonalized a
matrix with the order of 1,037,921, which contained 57,694,268 non-zero
elements done on half of the available, or 256, PE's.
**Info we care about:**

Average CPU usage | unknown |

Memory usage | approx 85MB |

Approx avg diagonalizing time/PE | 132 seconds |

Lanczos steps taken | 127 steps |

# of Ritz values found | 5 values |

Error bound | less than +/- 1.0E-09 |

`DNLASO`

and `PLANDR2`

`DNLASO`

function of that package. This package was a uniprocessor package and
so we compared it with the PLANSO package run on a single PE. Our test was on a
matrix with the order of 110,488 and containing 7,313,136 non-zero elements
which was the largest we could diagonalize with both this particular package
and the PLANSO package using only a single PE. We also compared the previous
results with multi PE runs of PLANSO on both the PC cluster and T3E, using 5 PE's on both,
and using the same test matrix.
**A comparison on the PC cluster**

DNLASO(1 PE) | PLANDR2 (1 PC cluster PE) | PLANDR2 (5 PC cluster PE's) | |

diag time/PE | 569sec. | 158sec. | 86sec. |

% speedup from previous | ---- | 350% | 180% |

CPU usage | 97-99% | 97-99% | 60-80% |

Mem usage for Lanczos vectors/PE | 5MB | 60MB | 19MB |

Lanczos steps | 162 | 133 | 122 |

**A comparison between PC cluster and the T3E**

PLANDR2 (5 PC cluster PE's) | PLANDR2 (5 T3E PE's) | |

diag time | 86sec. | 37sec. |

% speedup from previous | 180% (from 1 PC cluster PE) | 230% |

CPU usage | 60-80% | unknown |

Mem usage for Lanczos vectors/PE | 19MB | approx. 20MB |

Lanczos steps | 122 | 131 |

A comparison with the T3E multi PE and the PC cluster multi PE shows the lack in PE communication efficiency that the PC cluster has. CPU usage was not able to be measured, however, it is easy to deduce from the diagonalization times that the T3E has a higher CPU usage than our local PC cluster due mostly to slower communication between processors.

Note that the given function in the file "store.f," `store`

,
was slightly modified
when creating the lanso1 and plan libraries. The array `A`

was changed from having
500,000 elements to having 5,000,000 elements to accommodate the large lanczos
vector sizes.

The purpose of this internship was to modify Dr. van de Sande's code that
it diagonalizes certain Hamiltonian matrices in a truly parallel manner. His code
used the `DNLASO`

subroutine of the LANSO package as well as some LAPACK
routines for smaller matrices. The code was parallel in that a "master" process
controlled other "slave" processes that worked on a separate problem then the master
gathered the answers. Our goal was to modify the code so that all processes would
work on the same problem and then leave the answer with the master process.

The summer internship started with learning the OS Linux and other fundamentals of
what the research would deal with throughout the summer. The next task tackled was
the diagonalizing of a test matrix using LAPACK routines which helped to learn the basics
of the GNU compilers. Next, the LANSO package with the `DNLASO`

subroutine was used to diagonalize the same test matrix.
Then a matrix vector multiply was created that would use
Dr. van de Sande's format for storing the Hamiltonian sparse symmetric matrix. This was
analogous to his current progress of the diagonalizing code. Dr. van de Sande's and the
current code were then benchmarked against each other.

The next step was researching the Cray T3E at PSC and it's environment for this is where the code would eventually be fully tested and ran. After the Cray had been researched, research began on packages to diagonalize the Hamiltonians completely in parallel. This research resulted in finding the PLANSO package. After extensive testing and debugging the code was run on both a PC cluster and on the Cray. The benchmarks are listed above. (To Benchmarking). During the final debugging and testing another similar pack TRLan was discovered and also researched. It was said to have produced the same results as PLANSO but was unable to be completely researched due to lack of a Fortran 90 compiler. After the benchmarking and final testing the PLANSO code was integrated with Dr. van de Sande's code and some final modifications of the current code, his code, and the PLANSO package were performed. This is where the internship left off.

Brett van de Sande,

Homepage: http://www.geneva.edu/~bvds and

E-mail: bvds@pitt.edu