Benchmark Test of MPI programs by PRIMEPOWER HPC2500
Benchmark Test of New Domain Decomposition 3-Dimensional MHD Code
Results of Benchmark Test of The Domain Decomposition 3-Dimensional Fortran MHD Code
November 9, 2004
STEL, Nagoya University
Tatsuki Ogino
Summary of the execution results ob benchmark test by domain decomposition MHD
simulation codes is presented. The 3-dimensional domain decomposition MHD code
(D) which was newly developed for the scalar-parallel machine demonstrates high
performance as was expected. On the other hand, the 2-dimensional domain
decomposition MHD code (B) can possibly give the best performance for the
vector-parallel machine with a large number of cpus.
(A) mearthd1dc2n.f 1D Domain Decomposition for Vector-Parallel Machine f(nx2,ny2,nz2,nb)
(B) mearthd2dc2n.f 2D Domain Decomposition for Vector-Parallel Machine f(nx2,ny2,nz2,nb)
(C) mearthd3dc2n.f 3D Domain Decomposition for Vector-Parallel Machine f(nx2,ny2,nz2,nb)
(D) mearthd3dd2n.f 3D Domain Decomposition for Scalar-Parallel Machine f(nb,nx2,ny2,nz2)
In Table 1 are summarized the computation speeds per one cpu (GFlops/cpu) when
the four MHD codes were executed in four kinds of supercomputers. The computation
speeds for (A), (B) and (C) show high efficiency for the vector-parallel machine
as was expected, on the contrary they do not show high efficiency for scalar-
parallel machine, HPC2500. In particular, the computation speed for (A) shows
the worst efficiency. This is because the cache hit rate goes down due to the
large array. On the other hand, the best efficiency of parallel computation
(over 1.4 GF/cpu in HPC2500) is realized for (D) due to increase of cache hit
rate.
In Table 2 are shown the computation speeds that the four MHD codes with
increased array, (nx,ny,nz)=(510,510,510) were executed in NEC ES (Earth
Simulator). The MHD codes could not run in more than 64 cpus due to the
limit of computer memory size. For 512 cpus the 2D Domain Decomposition
(B) shows the best efficiency, moreover it does best scalability up to
512 cpus for a grid size of (nx,ny,nz)=(510,254,254) in Table 1. As was
expected, the 2D Domain Decomposition MHD code in (B) can be vectorized
in x-direction and parallelized in two dimensions in y and z directions.
Therefore, it becomes the best efficient code for the vector-parallel
machine with large number of cpus due to decrease of communication
amount.
In Table 3 are summarized the computation speeds for the simple linear code
to solve the 3-dimensional wave equation instead of MHD equation. It is very
simple but the structure of program in 3D wave equation is same as MHD
equation. The value in ratio-1 shows a relative computation speed which is
divided by the speed for 8 cpus and the larger value means higher efficiency.
The value in ratio-2 shows another relative computation speed which is
divided by the speed for 8 cpus with grid number (nx,ny,nz)=(126,126,126) in
HPC2500. It is noted that the decrease of parallel computation performance
cannot find for increase of cpu number up to 512 and increase of array size
up to (1022,1022,1022) in ratio-2. It looks like happening that the parallel
computation performance preferably increase for large number of cpus. This
implys favorable possibility that the communication amount does not increase
for parallel computation and saturates even for great increase of number of
cpus because the communication is required among only the neighbor cpus.
The 3D MHD code in (D) mearthd3dd2n.f has the same structure as the 3D
wave code in mwave611.f. Therefore it is strongly expected that the MHD code
in (D) mearthd3dd2n.f will be able to show better performance in execution
with a large number of cpus in HPC 2500. In fact high efficiency can be
obtained in a 3D MHD code up to 16 cpus in Table 1. The parallel computation
speed, 1.416 GF/cpu for (D) is better than that (about 1.2-1.3 GF/cpu) for
(A). Therefore the 3D domain decomposition MHD code (D) mearthd3dd2n.f has
high possibility to show the best performance in execution by using
scalar-parallel machine, HPC2500.
November 9, 2004
Solar-Terrestrial Environment Laboratory, Nagoya University
Tatsuki Ogino
ogino@stelab.nagoya-u.ac.jp
----------------------------------------------------------------------
Table 1. Computer Processing Capability by 3D MHD Code
for (nx,ny,nz)=(510,254,254) (Gflops/cpu)
----------------------------------------------------------------------
CPU VPP5000 PRIMEPOWER NEC SX6 NEC ES
Number HPC2500
(GF/PE) (ngrd1) (GF/PE) (GF/PE)
----------------------------------------------------------------------
(A) mearthd1dc2n.f 2 7.08 ----- 6.36 6.66
1D Domain Decomposition 4 7.02 0.031 5.83 6.60
by f(nx2,ny2,nz2,nb) 8 6.45 0.030 5.51 6.50
16 6.18 0.028 5.44 6.49
32 6.39
64 6.37
(B) mearthd2dc2n.f 4 7.51 0.199 6.34 6.63
2D Domain Decomposition 8 6.88 0.191 6.28 6.47
by f(nx2,ny2,nz2,nb) 16 6.49 0.200 6.23 6.45
32 6.47
64 6.32
512 5.62
(C) mearthd3dc2n.f 8 7.14 0.207 6.24 6.38
3D Domain Decomposition 16 6.77 0.202 6.34 6.33
by f(nx2,ny2,nz2,nb) 32 6.25
64 5.61
512 3.94
(D) mearthd3dd2n.f 8 2.91 1.438 1.13 4.11
3D Domain Decomposition 16 2.63 1.416 4.53 4.06
by f(nb,nx2,ny2,nz2) 32 4.11
64 4.17
512 3.70
----------------------------------------------------------------------
----------------------------------------------------------------------
Table 2. Computer Processing Capability by 3D MHD Code
for (nx,ny,nz)=(510,510,510) (Gflops/cpu)
----------------------------------------------------------------------
CPU VPP5000 PRIMEPOWER NEC SX6 NEC ES
Number HPC2500
(GF/PE) (ngrd1) (GF/PE) (GF/PE)
----------------------------------------------------------------------
(A) 1D Domain Decomposition 64 ----- ----- ----- 6.30
by f(nx2,ny2,nz2,nb)
(B) 2D Domain Decomposition 64 ----- ----- ----- 6.46
by f(nx2,ny2,nz2,nb) 512 6.22
(C) 3D Domain Decomposition 64 ----- ----- ----- 5.58
by f(nx2,ny2,nz2,nb) 512 4.12
(D) 3D Domain Decomposition 64 ----- ----- ----- 4.21
by f(nb,nx2,ny2,nz2) 512 3.78
----------------------------------------------------------------------
------------------------------------------------------------------------------------
Table 3. CPU time of 8 step advances in simulation code for 3D wave equation,
mwave601.f (126,126,126) 1D Domain Decomposition
mwave611.f (126,126,126) 3D Domain Decomposition
PRIMEPOWER HPC2500 (SPARC64 V 1.56GHz x 128) x 4 units
Parallelnavi2.3
Solaris Ver.8
Compiler Options:
-Kfast_GP2=3,laragepage=2 -O5 -Kprefetch=4,prefetch_cache_level=2,
prefetch_strong -Cpp -KV9 for grid number of (1022,1022,1022)
Set up of Nodes:
8-64 cpus for 1 node (available up to 127 cpus)
128 cpus for 2 nodes (64 cpus * 2)
256 cpus for 8 nodes (32 cpus * 8)
computer node number mwave601.f mwave611.f grid number ratio-1 ratio-2
ngrd1 8cpu 2.40222(s) 0.44976(s) ( 126, 126, 126) 0.9284
ngrd1 16cpu 1.23224(s) 0.22249(s) ( 126, 126, 126) 0.9384
ngrd1 32cpu 0.67273(s) 0.11144(s) ( 126, 126, 126) 0.9367
HPC2500 1 8cpu 1.93585 0.41756 ( 126, 126, 126) 1.0000 1.0000
HPC2500 1 16cpu 1.03095 0.21870 ( 126, 126, 126) 0.9546
HPC2500 1 32cpu 0.61046 0.11134 ( 126, 126, 126) 0.9376
HPC2500 1 64cpu 0.04028 ( 126, 126, 126) 1.2958
HPC2500 2 128cpu 0.02490 ( 126, 126, 126) 1.0481
HPC2500 8 256cpu 0.01176 ( 126, 126, 126) 1.1096
HPC2500 1 8cpu 191.28384 2.85172 ( 254, 254, 254) 1.0000 1.1714
HPC2500 1 16cpu 98.17854 1.64795 ( 254, 254, 254) 0.8652 1.0135
HPC2500 1 32cpu 51.53204 0.85513 ( 254, 254, 254) 0.8337 0.9766
HPC2500 1 64cpu 29.37091 0.40170 ( 254, 254, 254) 0.8874 1.0395
HPC2500 2 128cpu 0.23528 ( 254, 254, 254) 0.7575 0.8873
HPC2500 8 256cpu 0.10433 ( 254, 254, 254) 0.8542 1.0006
HPC2500 1 8cpu 1656.47640 22.13068 ( 510, 510, 510) 1.0000 1.2075
HPC2500 1 16cpu 853.39174 10.79360 ( 510, 510, 510) 1.0252 1.2380
HPC2500 1 32cpu 456.50143 5.71031 ( 510, 510, 510) 0.9689 1.1700
HPC2500 1 64cpu 273.57107 3.14025 ( 510, 510, 510) 0.8809 1.0637
HPC2500 2 128cpu 1.77276 ( 510, 510, 510) 0.7802 0.9421
HPC2500 8 256cpu 0.78852 ( 510, 510, 510) 0.8771 1.0591
HPC2500 1 8cpu 186.24975 (1022,1022,1022) 1.0000 1.1479
HPC2500 1 16cpu 82.38303 (1022,1022,1022) 1.1304 1.2976
HPC2500 1 32cpu 44.09687 (1022,1022,1022) 1.0559 1.2120
HPC2500 1 64cpu 2495.20300 27.04602 (1022,1022,1022) 0.7608 0.8733
HPC2500 2 128cpu 12.20572 (1022,1022,1022) 0.9537 1.0947
HPC2500 8 256cpu 5.40574 (1022,1022,1022) 1.0767 1.2359
VPP5000 4cpu 0.30021(s) ( 126, 126, 126)
VPP5000 8cpu 0.21667(s) 0.14309(s) ( 126, 126, 126) 2.9195
VPP5000 16cpu 0.08808(s) 0.11995(s) ( 126, 126, 126) 1.7406
------------------------------------------------------------------------------------
Back to the previous page