:: home
:: team
Pwscf Pwscf Pwscf
PwscfHOME
Pwscf Pwscf
Pwscf Pwscf
Pwscf Pwscf
Pwscf
Menu Pwscf
Pwscf Pwscf

::

about PWscf
Pwscf Pwscf

::
Pwscf Pwscf

::

user's guide
Pwscf Pwscf

::

download PWscf
Pwscf Pwscf

::

tests and examples
Pwscf Pwscf

::

pseudopotentials
Pwscf Pwscf

::

scientific literature
Pwscf Pwscf
Pwscf

USER'S GUIDE 1.1> Performance Issues
Pwscf
Pwscf
Pwscf


Pwscf Performance Issues


Subsections

  CPU time requirements
Pwscf Pwscf

 

The following holds for code pw.x and for non-US PPs. For US PPs there are additional terms to be calculated. For phonon calculations, each of the 3*Nat mode requires a CPU time of the same order of that required by a self-consistent calculation in the same system.

The computer time required for the self-consistent solution at fixed ionic positions, Tscf, is:

Tscf = Niter*Titer + Tinit

 

 
where Niter=niter=number of self-consistency iterations, Titer=CPU time for a single iteration, Tsub=initialization time for a single iteration. Usually Tinit < < Niter*Titer.

The time required for a single self-consistency iteration Titer is:

Titer = Nk*Tdiag + Trho + Tscf

 

(1)

here Nk=number of k-points, Tdiag=CPU time per hamiltonian iterative diagonalization, Trho=cpu time for charge density calculation. Tscf=CPU time for Hartree and exchange-correlation potential calculation.

The time for a hamiltonian iterative diagonalization Tdiag is:

Tdiag = Nh*Th + Torth + Tsub

 

(2)

here Nh=number of H$ \psi$ products needed by iterative diagonalization, Th=CPU time per H$ \psi$ product, Torth=CPU time for orthonormalization, Tsub=CPU time for subspace diagonalization.

The time Th required for a H$ \psi$ product is

Th = a1*M*N + a2*M*N1*N2*N3*log(N1*N2*N3) + a3*M*P*N.

 

(3)

he first term comes from the kinetic term and is usually much smaller than the others. The second and third terms come respectively from local and nonlocal potential. a1, a2, a3 are prefactors, M=number of valence bands, N=number of plane waves (basis set dimension), N1, N2, N3=dimensions of the FFT grid for wavefunctions ( N1*N2*N3 $ \sim$ 8N), P=number of projectors for PPs (summed on all atoms, on all values of the angular momentum l, and m=1,..,2l+1)

The time Torth required by orthonormalization is

Torth = b1*Mx2*N

 

(4)

and the time Tsub required by subspace diagonalization is

Tsub = b2*Mx3

 

(5)

here b1 and b2 are prefactors, Mx=number of trial wavefunctions (this will vary between M and a few times M, depending on the algorithm)

The time Trho for the calculation of charge density from wavefunctions is

Trho = c1*M*Nr1*Nr2*Nr3*log(Nr1*Nr2*Nr3) + c2*M*Nr1*Nr2*Nr3 + Tus

 

(6)

here c1, c2, c3 are prefactors, Nr1, Nr2, Nr3=dimensions of the FFT grid for charge density ( Nr1*Nr2*Nr3 $ \sim$ 8Ng, where Ng=number of G-vectors for the charge density), and Tus=CPU time required by ultrasoft contribution (if any).

The time Tscf for calculation of potential from charge density is

Tscf = d2*Nr1*Nr2*Nr3 + d3*Nr1*Nr2*Nr3*log(Nr1*Nr2*Nr3)

 

(7)

here d1, d2 are prefactors.

TOP

  Memory requirements
Pwscf Pwscf


A typical pw.x run will require a maximum memory in the order of O double precision complex numbers, where

O = m*M*N + M*P + p*N1*N2*N3 + q*Nr1*Nr2*Nr3

 

(8)

ith m, p, q=small factors, all other variables have the same meaning as above.

This holds for the phonon code as well, with larger factors m, p, q.

TOP

  Parallelization issues
Pwscf Pwscf


The program can run in principle on any number of processors (up to maxproc, presently fixed at 128 in para/para.inc). The Np processors can be divided into Npk pools of Npr processors, Np = Npk*Npr. The k-points are divided across Npk pools ("k-point parallelization"), while both R- and G-space grids are divided across the Npr processors of each pool ("PW parallelization"). At present there is no parallelization on the number of bands.

Note that if you restart or read data from a preceding run you must restart with exactly the same number of processors and the same number of pools. The only exception is when a SCF calculation is followed by a band-structure calculation. In this case the only link between the two is the potential file that does not depend on the number of pools and processors.

The effectiveness of parallelization depends on the size and type of the system and on a judicious choice of the Npk and Npr:

  • k-point parallelization is very effective if Npk is a divisor of the number of k-points (linear speedup guaranteed), BUT it does not reduce the amount of memory per processor taken by the calculation. As a consequence large systems may not fit into memory.
  • PW parallelization works well if Npr is a divisor of both dimensions along the z axis of the FFT grids, N3 and Nr3 (which may coincide). It does not scale so well as k-point parallelization, but it reduces both cpu time AND memory (the latter almost linearly).
  • Optimal scalar performances are achieved when the data are as much as possible kept into the cache. This is very important for SGI Origin, less so for IBM machines. As a side effect, one can achieve better than linear scaling with the number of processors, thanks to the increase in scalar speed coming from the reduction of data size (making it is easier for the machine to keep data in the cache).

 

TOP

 

Pwscf
Pwscf      powered by Incipit