Benchmark Test of MPI programs by PRIMEPOWER HPC2500

Benchmark Test of New Domain Decomposition 3-Dimensional MHD Code

Results of Benchmark Test of The Domain Decomposition 3-Dimensional Fortran MHD Code

                     November 9, 2004
                     STEL, Nagoya University
                     Tatsuki Ogino

Summary of the execution results ob benchmark test by domain decomposition MHD 
simulation codes is presented. The 3-dimensional domain decomposition MHD code 
(D) which was newly developed for the scalar-parallel machine demonstrates high 
performance as was expected. On the other hand, the 2-dimensional domain 
decomposition MHD code (B) can possibly give the best performance for the 
vector-parallel machine with a large number of cpus. 

(A) mearthd1dc2n.f  1D Domain Decomposition for Vector-Parallel Machine  f(nx2,ny2,nz2,nb)
(B) mearthd2dc2n.f  2D Domain Decomposition for Vector-Parallel Machine  f(nx2,ny2,nz2,nb)
(C) mearthd3dc2n.f  3D Domain Decomposition for Vector-Parallel Machine  f(nx2,ny2,nz2,nb)
(D) mearthd3dd2n.f  3D Domain Decomposition for Scalar-Parallel Machine  f(nb,nx2,ny2,nz2)

In Table 1 are summarized the computation speeds per one cpu (GFlops/cpu) when 
the four MHD codes were executed in four kinds of supercomputers. The computation 
speeds for (A), (B) and (C) show high efficiency for the vector-parallel machine 
as was expected, on the contrary they do not show high efficiency for scalar-
parallel machine, HPC2500. In particular, the computation speed for (A) shows 
the worst efficiency. This is because the cache hit rate goes down due to the 
large array. On the other hand, the best efficiency of parallel computation 
(over 1.4 GF/cpu in HPC2500) is realized for (D) due to increase of cache hit 

In Table 2 are shown the computation speeds that the four MHD codes with 
increased array, (nx,ny,nz)=(510,510,510) were executed in NEC ES (Earth 
Simulator). The MHD codes could not run in more than 64 cpus due to the 
limit of computer memory size. For 512 cpus the 2D Domain Decomposition 
(B) shows the best efficiency, moreover it does best scalability up to 
512 cpus for a grid size of (nx,ny,nz)=(510,254,254) in Table 1. As was 
expected, the 2D Domain Decomposition MHD code in (B) can be vectorized 
in x-direction and parallelized in two dimensions in y and z directions. 
Therefore, it becomes the best efficient code for the vector-parallel 
machine with large number of cpus due to decrease of communication 

In Table 3 are summarized the computation speeds for the simple linear code 
to solve the 3-dimensional wave equation instead of MHD equation. It is very 
simple but the structure of program in 3D wave equation is same as MHD 
equation. The value in ratio-1 shows a relative computation speed which is 
divided by the speed for 8 cpus and the larger value means higher efficiency. 
The value in ratio-2 shows another relative computation speed which is 
divided by the speed for 8 cpus with grid number (nx,ny,nz)=(126,126,126) in 
HPC2500. It is noted that the decrease of parallel computation performance 
cannot find for increase of cpu number up to 512 and increase of array size 
up to (1022,1022,1022) in ratio-2. It looks like happening that the parallel 
computation performance preferably increase for large number of cpus. This 
implys favorable possibility that the communication amount does not increase 
for parallel computation and saturates even for great increase of number of 
cpus because the communication is required among only the neighbor cpus.  

The 3D MHD code in (D) mearthd3dd2n.f has the same structure as the 3D 
wave code in mwave611.f. Therefore it is strongly expected that the MHD code 
in (D) mearthd3dd2n.f will be able to show better performance in execution 
with a large number of cpus in HPC 2500. In fact high efficiency can be 
obtained in a 3D MHD code up to 16 cpus in Table 1. The parallel computation 
speed, 1.416 GF/cpu for (D) is better than that (about 1.2-1.3 GF/cpu) for 
(A). Therefore the 3D domain decomposition MHD code (D)  mearthd3dd2n.f has 
high possibility to show the best performance in execution by using 
scalar-parallel machine, HPC2500. 

November 9, 2004
Solar-Terrestrial Environment Laboratory, Nagoya University
Tatsuki Ogino

Table 1. Computer Processing Capability by 3D MHD Code
         for (nx,ny,nz)=(510,254,254)     (Gflops/cpu)
                              CPU     VPP5000  PRIMEPOWER  NEC SX6  NEC ES
                              Number           HPC2500
                                      (GF/PE)  (ngrd1)    (GF/PE)  (GF/PE)
(A) mearthd1dc2n.f              2      7.08     -----      6.36      6.66
 1D Domain Decomposition        4      7.02     0.031      5.83      6.60
 by f(nx2,ny2,nz2,nb)           8      6.45     0.030      5.51      6.50
                               16      6.18     0.028      5.44      6.49
                               32                                    6.39
                               64                                    6.37

(B) mearthd2dc2n.f              4      7.51     0.199      6.34      6.63
 2D Domain Decomposition        8      6.88     0.191      6.28      6.47  
 by f(nx2,ny2,nz2,nb)          16      6.49     0.200      6.23      6.45
                               32                                    6.47
                               64                                    6.32
                              512                                    5.62
(C) mearthd3dc2n.f              8      7.14     0.207      6.24      6.38
 3D Domain Decomposition       16      6.77     0.202      6.34      6.33
 by f(nx2,ny2,nz2,nb)          32                                    6.25
                               64                                    5.61
                              512                                    3.94
(D) mearthd3dd2n.f              8      2.91     1.438      1.13      4.11
 3D Domain Decomposition       16      2.63     1.416      4.53      4.06
 by f(nb,nx2,ny2,nz2)          32                                    4.11
                               64                                    4.17
                              512                                    3.70
Table 2. Computer Processing Capability by 3D MHD Code
         for (nx,ny,nz)=(510,510,510)     (Gflops/cpu)
                              CPU     VPP5000  PRIMEPOWER  NEC SX6  NEC ES
                              Number           HPC2500
                                      (GF/PE)  (ngrd1)    (GF/PE)  (GF/PE)
(A) 1D Domain Decomposition    64      -----    -----     -----      6.30
    by f(nx2,ny2,nz2,nb) 

(B) 2D Domain Decomposition    64      -----    -----     -----      6.46
    by f(nx2,ny2,nz2,nb)      512                                    6.22

(C) 3D Domain Decomposition    64      -----    -----     -----      5.58
    by f(nx2,ny2,nz2,nb)      512                                    4.12

(D) 3D Domain Decomposition    64      -----    -----     -----      4.21
    by f(nb,nx2,ny2,nz2)      512                                    3.78
Table 3. CPU time of 8 step advances in simulation code for 3D wave equation, 
  mwave601.f (126,126,126) 1D Domain Decomposition
  mwave611.f (126,126,126) 3D Domain Decomposition
  PRIMEPOWER HPC2500 (SPARC64 V 1.56GHz x 128) x 4 units
  Solaris Ver.8

  Compiler Options:
  -Kfast_GP2=3,laragepage=2 -O5 -Kprefetch=4,prefetch_cache_level=2,
  prefetch_strong -Cpp -KV9 for grid number of (1022,1022,1022)
  Set up of Nodes:
   8-64 cpus for 1 node  (available up to 127 cpus)
   128 cpus for  2 nodes (64 cpus * 2)
   256 cpus for  8 nodes (32 cpus * 8)

computer  node    number     mwave601.f    mwave611.f   grid number       ratio-1 ratio-2
ngrd1              8cpu      2.40222(s)    0.44976(s)   ( 126, 126, 126)  0.9284
ngrd1             16cpu      1.23224(s)    0.22249(s)   ( 126, 126, 126)  0.9384
ngrd1             32cpu      0.67273(s)    0.11144(s)   ( 126, 126, 126)  0.9367

HPC2500      1     8cpu      1.93585       0.41756      ( 126, 126, 126)  1.0000  1.0000
HPC2500      1    16cpu      1.03095       0.21870      ( 126, 126, 126)  0.9546
HPC2500      1    32cpu      0.61046       0.11134      ( 126, 126, 126)  0.9376
HPC2500      1    64cpu                    0.04028      ( 126, 126, 126)  1.2958
HPC2500      2   128cpu                    0.02490      ( 126, 126, 126)  1.0481
HPC2500      8   256cpu                    0.01176      ( 126, 126, 126)  1.1096

HPC2500      1     8cpu    191.28384       2.85172      ( 254, 254, 254)  1.0000  1.1714
HPC2500      1    16cpu     98.17854       1.64795      ( 254, 254, 254)  0.8652  1.0135
HPC2500      1    32cpu     51.53204       0.85513      ( 254, 254, 254)  0.8337  0.9766
HPC2500      1    64cpu     29.37091       0.40170      ( 254, 254, 254)  0.8874  1.0395
HPC2500      2   128cpu                    0.23528      ( 254, 254, 254)  0.7575  0.8873
HPC2500      8   256cpu                    0.10433      ( 254, 254, 254)  0.8542  1.0006

HPC2500      1     8cpu   1656.47640      22.13068      ( 510, 510, 510)  1.0000  1.2075
HPC2500      1    16cpu    853.39174      10.79360      ( 510, 510, 510)  1.0252  1.2380
HPC2500      1    32cpu    456.50143       5.71031      ( 510, 510, 510)  0.9689  1.1700
HPC2500      1    64cpu    273.57107       3.14025      ( 510, 510, 510)  0.8809  1.0637
HPC2500      2   128cpu                    1.77276      ( 510, 510, 510)  0.7802  0.9421
HPC2500      8   256cpu                    0.78852      ( 510, 510, 510)  0.8771  1.0591

HPC2500      1     8cpu                  186.24975      (1022,1022,1022)  1.0000  1.1479
HPC2500      1    16cpu                   82.38303      (1022,1022,1022)  1.1304  1.2976
HPC2500      1    32cpu                   44.09687      (1022,1022,1022)  1.0559  1.2120
HPC2500      1    64cpu   2495.20300      27.04602      (1022,1022,1022)  0.7608  0.8733
HPC2500      2   128cpu                   12.20572      (1022,1022,1022)  0.9537  1.0947
HPC2500      8   256cpu                    5.40574      (1022,1022,1022)  1.0767  1.2359

VPP5000            4cpu      0.30021(s)                 ( 126, 126, 126)
VPP5000            8cpu      0.21667(s)    0.14309(s)   ( 126, 126, 126)          2.9195
VPP5000           16cpu      0.08808(s)    0.11995(s)   ( 126, 126, 126)          1.7406

Back to the previous page