VASP

The Vienna Ab initio Simulation Package (VASP) [1] is a package for performing electronic structure calculations from first principles, based on density-functional-theory [2]. In VASP, central quantities, like the one-electron wave functions, the electronic charge density, and the local potential are expressed in plane wave basis sets, and the interactions between ions and electrons are described using the projector-augmented-wave method [3]. The atomic structures studied with VASP are specified by a unit cell, subject to periodic boundary conditions. This latter is illustrated by Figure 1, that shows a contour plot of the self-consistent charge density in a simple cubic unit cell of Si.

To determine the electronic groundstate, VASP makes use of efficient iterative matrix diagonalisation techniques, like the residual minimisation method with direct inver-sion of the iterative subspace (RMM-DIIS) used in the benchmarks presented below. These are coupled to highly efficient Broyden and Pulay density mixing schemes to speed up the self-consistency cycle (see [1] for a detailed description of VASP).

Figure 1. Contour plot of the charge density in a simple cubic unit cell of Si.

Scaling

The computational cost of the RMM-DIIS iterative diagonalisation of the Hamiltonian scales as
N_b N_pw ln N_pw,
where N_b is the number of occupied electronic orbitals in the system, N_pw is the number of plane waves in the basis set, and N_pw ln N_pw is the cost of a Fast Fourier Transform. Since N_b and N_pw scale linearly with increasing system size N, the fundamental scaling behaviour of the RMM-DIIS is N² ln N.

To end up with a robust algorithm, the one-electron wave functions obtained after several iterations of the RMM-DIIS diagonalisation have to be explicitly orthonormalised. This is done by Choleski (LU) decomposition, which unfortunately scales as N_b² N_pw (i.e. N³).

For very large systems, the orthonormalisation will become the dominating step. The following benchmarks, however, were still strongly characterized by the cost of the RMM-DIIS.

Figure 2 shows the overall scaling of the self-consistency cycle (red line) and cost of the orthonormalisation (blue line) with increasing system size (diamond, with N=256, 512, 1024, 2048, and 4096 atoms in the unit cell; 2 valence states per atom) on 32 cores of the VSC. The green line shows the ratio between the time per iteration in the self-consistency cycle and the time per orthonormalisation step. Clearly, as the system size increases, the cost of orthonormalisation makes up an ever larger part of the total effort. Note that in the examples in Figure 2, the orthonormalisation is not the only part of the complete algorithm that scales as N³. To analyse this further, however, is beyond the scope of the present contribution.

Figure 2. Dependence of the computational cost on the system size N; the total time per iteration in the self-consistency cycle (SCC: red line), and the time per orthonormalisation step (blue line), both relative to the corresponding contributions for N=256 atoms. The black line denotes the theoretical N scaling of the RMM-DIIS iterative matrix diagonalisation, and the green line represents the ratio between the total time per SCC iteration and the time per orthonormalisation step.

The scaling behaviour of VASP with respect to the number of compute cores is illustrated in Figure 3, for diamond with 1024 (red line), 2048 (blue line), and 4096 (green line) atoms in the unit cell. The black line represents the nominal speedup (linear w.r.t. the number of cores). For the benchmarks systems presented here, VASP scales nicely up to 64 cores. The largest two systems, with N=2048 and 4096 atoms in the unit cell, show a satisfactory speedup up to 128 cores. For all systems under consideration the speedup is not as good beyond 128 cores. A more detailed analysis (not shown here) reveals that the part that scales the worst w.r.t. the number of compute cores is the orthonormalisation (Choleski decomposition from Intel MKL’s scaLAPACK).

Figure 3. Computational speedup with respect to the number of compute cores, for a diamond unit cell containing 1024 (red line), 2048 (blue line), and 4096 (green line) atoms. The nominal scaling is represented by the dashed black line.

Software

Compiler: Intel Fortran 11.1
Libraries: FFTW, Intel MKL (BLAS, LAPACK, and scaLAPACK)
Parallelisation: QLogic MPI

References

[1] G. Kresse and J. Furthmueller, Comput. Mat. Sci. 6, 15-50 (1996). G. Kresse and J. Furthmueller, Phys. Rev. B 54, 11169 (1996).

[2] W. Kohn, Rev. Mod. Phys. 71, 1253 (1999).

[3] P. E. Bloechl, Phys. Rev. B 50, 17953 (1994). G. Kresse and D. Joubert, Phys. Rev. B 59, 1758 (1999).