Subsections


Performance issues (PWscf)

CPU time requirements

The following holds for code pw.x and for non-US PPs. For US PPs there are additional terms to be calculated. For phonon calculations, each of the 3Nat modes requires a CPU time of the same order of that required by a self-consistent calculation in the same system.

The computer time required for the self-consistent solution at fixed ionic positions, Tscf , is:

Tscf = Niter . Titer + Tinit

where Niter = niter = number of self-consistency iterations, Titer = CPU time for a single iteration, Tsub = initialization time for a single iteration. Usually Tinit < < Niter . Titer .

The time required for a single self-consistency iteration Titer is:

Titer = Nk . Tdiag + Trho + Tscf

where Nk = number of k-points, Tdiag = CPU time per hamiltonian iterative diagonalization, Trho = CPU time for charge density calculation, Tscf = CPU time for Hartree and exchange-correlation potential calculation.

The time for a Hamiltonian iterative diagonalization Tdiag is:

Tdiag = Nh . Th + Torth + Tsub

where Nh = number of H products needed by iterative diagonalization, Th = CPU time per H product, Torth = CPU time for orthonormalization, Tsub = CPU time for subspace diagonalization.

The time Th required for a H product is

Th = a1 . M . N + a2 . M . N1 . N2 . N3 . log(N1 . N2 . N3) + a3 . M . P . N.

The first term comes from the kinetic term and is usually much smaller than the others. The second and third terms come respectively from local and nonlocal potential. a1 , a2 , a3 are prefactors, M = number of valence bands, N = number of plane waves (basis set dimension), N1 , N2 , N3 = dimensions of the FFT grid for wavefunctions ( N1 . N2 . N3 8N ), P = number of projectors for PPs (summed on all atoms, on all values of the angular momentum l , and m = 1,..., 2l + 1 )

The time Torth required by orthonormalization is

Torth = b1*Mx2*N

and the time Tsub required by subspace diagonalization is

Tsub = b2*Mx3

where b1 and b2 are prefactors, Mx = number of trial wavefunctions (this will vary between M and a few times M , depending on the algorithm).

The time Trho for the calculation of charge density from wavefunctions is

Trho = c1 . M . Nr1 . Nr2 . Nr3 . log(Nr1 . Nr2 . Nr3) + c2 . M . Nr1 . Nr2 . Nr3 + Tus

where c1 , c2 , c3 are prefactors, Nr1 , Nr2 , Nr3 = dimensions of the FFT grid for charge density ( Nr1 . Nr2 . Nr3 8Ng , where Ng = number of G-vectors for the charge density), and Tus = CPU time required by ultrasoft contribution (if any).

The time Tscf for calculation of potential from charge density is

Tscf = d2 . Nr1 . Nr2 . Nr3 + d3 . Nr1 . Nr2 . Nr3 . log(Nr1 . Nr2 . Nr3)

where d1 , d2 are prefactors.

Memory requirements

A typical self-consistency or molecular-dynamics run requires a maximum memory in the order of O double precision complex numbers, where

O = m . M . N + P . N + p . N1 . N2 . N3 + q . Nr1 . Nr2 . Nr3

with m , p , q = small factors; all other variables have the same meaning as above. Note that if the -point only ( q = 0 ) is used to sample the Brillouin Zone, the value of N will be cut into half.

Code memory.x yields a rough estimate of the memory required by pw.x and checks for the validity of the input data file as well. Use it exactly as pw.x.

The memory required by the phonon code follows the same patterns, with somewhat larger factors m , p , q .

File space requirements

A typical pw.x run will require an amount of temporary disk space in the order of O double precision complex numbers:

O = Nk . M . N + q . Nr1 . Nr2 . Nr3

where q = 2 . mixing (number of iterations used in self-consistency, default value = 8 ) if disk_io is set to 'high' or not specified; q = 0 if disk_io='low' or 'minimal'.


Parallelization issues

pw.x can run in principle on any number of processors (up to maxproc, presently fixed at 128 in PW/para.f90). The Np processors can be divided into Npk pools of Npr processors, Np = Npk*Npr . The k-points are divided across Npk pools (``k-point parallelization''), while both R- and G-space grids are divided across the Npr processors of each pool (``PW parallelization''). A third level of parallelization, on the number of bands, is currently confined to the calculation of a few quantities that would not be parallelized at all otherwise. A fourth level of parallelization, on the number of NEB images, is available for NEB calculation only.

The effectiveness of parallelization depends on the size and type of the system and on a judicious choice of the Npk and Npr :

Note that for each system there is an optimal range of number of processors on which to run the job. A too large number of processors will yield performance degradation, or may cause the parallelization algorithm to fail in distributing properly R- and G-space grids.

Note also that Beowulf-style machines (PC clusters) may have disappointing parallelization performances unless they have a decent communication hardware (at least Gigabit ethernet). Do not expect good scaling with cheap hardware: plane-wave calculations are not at all an "embarrassing parallel" problem. Note that multiprocessor motherboards for Intel Pentium CPUs typically have just one memory bus for all processors. This dramatically slows down any code doing massive access to memory (as most codes in the Quantum-ESPRESSO package do) that runs on processors of the same motherboard.

The PWSCF Group - 2005-11-18