![]() |
|
Parallel programming in MC#: an experience of
implementation
of ALCMD molecular dynamics code
Technical Report Program Systems Institute of Pereslavl-Zalessky, July 9, 2005
Table of
contents Parallel
programming on MC#: an experience of implementation of ALCMD molecular dynamics
code 3. The computational
molecular dynamics package ALCMD (Ames Lab Classical Molecular Dynamics) 4. Implementation of ALCMD
package on MC# language 4.1.
Movable methods as virtual processors. 4.3.
Blackboard – method of implementation of collective operations in MC# language 5. Comparison of ALCMD
implementation on MC# language with original ALCMD code 5.1.
Issues of result comparability 5.2.
Further work on improving Mono implementation of .Net platform and MC#
programming system 1. IntroductionThis report contains the details about the implementation of the ALCMD package (Ames Lab Classical Molecular Dynamics, http://cmp.ameslab.gov/cmp/CMP_Theory/cmd/alcmd_source.html) on the parallel programming language MC# (http://u.pereslavl.ru/~vadim/MCSharp) The main purpose of this project is to estimate whether it is possible to implement complex computational programs (that are usually implemented on C and FORTRAN languages with the help of some communication libraries like MPI) on byte-code languages like C# in managed runtime environments. In this project we used a Mono platform – a free implementation of .Net platform for Unix-like platforms (http://www.mono-project.com/). 2. The basics
of MC# language
MC# language is the universal high-level programming language based on C# language the main purpose of which is the calculations on cluster architectures. Specific features of the MC# language consist in the
transferring of asynchronous parallel programming model of Polyphonic C# ( Asynchronous methods of Polyphonic C# (nowadays known as Cw) in MC# language can be planned for execution on the remote machine (for example it could be a node of the cluster) and are known in MC# as movable methods. Interaction between movable methods that are executed on different nodes can be realized through the channels – a special syntactic class in MC# language. For synchronization purposes MC# uses the bounds that are analogue of the chords of Polyphonic C#. Though the channels essentially are one-way entities (same as in join-calculus) in MC# there were introduced so called “bi-directional channels” that exist in the system as a separate class BDChannel. Using this kind of channels you can send and receive values at the same time from the same channel. More information about MC# language, including articles and examples of implemented applications is available on the site of the MC# project: http://u.pereslavl.ru/~vadim/MCSharp/ 3. The computational molecular dynamics package ALCMD (
|
|
using System; public class
Blackboard { private BDChannel[] inChannels; private BDChannel[] outChannels; private int size; public Blackboard(
int size )
{ inChannels =
new BDChannel [size]; for ( int i = 0; i
< size; i++ ) inChannels
[i] = new BDChannel(); this.size =
size; } public void setOutChannels( BDChannel[] outChannels ) { this.outChannels = outChannels; } public void setValue( object
o ) { for ( int i = 0; i
< size; i++ ) outChannels [i].Send( o ); } public double[] getdArraySum() { double[] darray, result; result = (double[]) inChannels [0].Receive() [0]; for ( int i = 1; i
< size; i++ ) { darray = (double[]) inChannels [ i ].Receive() [0]; for ( int j = 0; j
< darray.Length; j++ ) result [ j
] += darray [j]; } return result; } public
BDChannel[] getInChannels() { return inChannels; } } |
As an implementation of .Net platform we used Mono 1.1.8.2
(http://www.mono-project.com/Downloads).
All tests were made on cluster SKIF (16 SMP nodes with processors Athlon MP 1800+) and cluster K-1000 (maximum number of nodes – 288 with Opteron processors).
The results of tests that can be used to compare MC# implementation and original ALCMD code are shown below.
Cluster “SKIF FIRST-BORN M” (Fast Ethernet, LAM/MPI with GNU FORTRAN
for original ALCMD)
Parameters of ALCMD:
1) Number of atoms – 384000,
2) Number of iterations –
100.
|
Number of processors |
MC# |
Original ALCMD |
|
1 |
15:27.2 |
7:41.8 |
|
2 |
8:17.5 |
4:41.0 |
|
4 |
4:42.6 |
2:40.0 |
|
8 |
2:33.3 |
1:25.0 |
|
16 |
1:28.2 |
0:48.1 |
|
24 |
1:23.9 |
0:46.6 |
|
32 |
1:38.1 |
0:40.4 |

Cluster “SKIF K-
Parameters of ALCMD:
1) Number of atoms – 768000,
2) Number of iterations –
100.
|
Number of processors |
MC# |
Original ALCMD |
|
1 |
19:07.2 |
8:56.6 |
|
2 |
9:56.2 |
4:26.4 |
|
4 |
5:09.9 |
2:16.2 |
|
8 |
2:37.4 |
1:09.2 |
|
16 |
1:29.8 |
0:35.6 |
|
32 |
2:12.7 |
0:58.6 |
|
48 |
3:22.8 |
0:54.5 |

In order to compare the results of calculations of MC# version and original version we fixed in both versions the original positions of atoms (without these changes all atoms before the calculations get the random movements from ideal positions).
In most cases MC#-version gives the similar results when comparing with original version. Here is the example of output for the system that contains 384000 atoms for both versions.
Original ALCMD:
|
mpirun -np 8 mdlj CMD started on 8 processors 384000 atoms total 8 nodes arranged in 4 columns and 2 rows Using 1204 grids arranged 43 by 28 ( 2.30A by 2.36A) 0.00% load imbalance before Redistribute( 48000 48000) Potential cutoff = 2.500000 2.875000 Done calculating passing sequence. Node 0 has 48000 atoms and passes 4 times 4 3 4 1 Starting the main loop. 86.000 nbors/atom
--> passing 11281 of 48000 atoms (max stray = Step Atoms Temp Kinet En Poten En Total En Change 10 384000 0.0 0.0000 -7.1189 -7.1189 0.0000 0.0000E+00 20 384000 0.0 0.0000 -7.8888 -7.8888 -0.7699 0.0000E+00 30 384000 9.7 0.0013 -7.8911 -7.8899 -0.7709 0.0000E+00 40 384000 172.4 0.0223 -7.8454 -7.8231 -0.7042 0.0000E+00 50 384000 258.6 0.0334 -7.8582 -7.8248 -0.7059 0.0000E+00 60 384000 186.8 0.0241 -7.8639 -7.8397 -0.7208 0.0000E+00 70 384000 201.3 0.0260 -7.8657 -7.8397 -0.7208 0.0000E+00 80 384000 236.0 0.0305 -7.8702 -7.8397 -0.7208 0.0000E+00 90 384000 219.5 0.0284 -7.8681 -7.8398 -0.7208 0.0000E+00 100 384000 193.4 0.0250
-7.8647 -7.8398 -0.7208 0.0000E+00 |
MC# ALCMD:
|
*** ALCMD MC# version 1.81 20-July-2005 started *** Output file name = md.out 384000 atoms total 8 nodes arranged in 4 columns and 2 rows Using 1204 grids arranged 43 by 28 ( 2.30232558139535A by 2.35714285714286A ) 0.00% load imbalance before Redistribute ( 48000 48000 ) Potential cutoff = 2.5 2.875 Done calculating passing sequence Node 0 has 48000 atoms and passes 4 times 86 nborns/atom --> passing 11281 of 48000 atoms ( max stray =0 A) Step Atoms Temp Kinet En Poten En Total En Change 10 384000 0.00 0.00 -7.12 -7.12 0.00 0 20 384000 0.00 0.00 -7.89 -7.89 -0.77 0 30 384000 3.18 0.00 -7.89 -7.89 -0.77 0 40 384000 207.09 0.03 -7.85 -7.82 -0.70 0 50 384000 175.36 0.02 -7.85 -7.82 -0.70 0 60 384000 219.33 0.03 -7.87 -7.84 -0.72 0 70 384000 185.83 0.02 -7.87 -7.84 -0.72 0 80 384000 211.46 0.03 -7.87 -7.84 -0.72 0 90 384000 203.29 0.03 -7.87 -7.84 -0.72 0 100 384000 234.07 0.03 -7.87 -7.84 -0.72 0 |
Such discrepancy of results can be explained by the high computational difficulty of original FORTRAN source code, the precise translation of which requires considerable efforts and more time. This work is still not finished.
In particular “Redistribute” mode hasn’t been implemented correctly yet in MC# version – when different number of atoms is assigned to processors and a special procedure must be executed for redistribution of these atoms.
It worth to note that problem with correctness of the source code also exists in original FORTRAN code of ALCMD. When using some fixed distribution of unit cells in X, Y, Z- directions the results are different depending on the numbers of processors that are in use. This effect was confirmed by the developers of the original ALCMD package.
While we were testing the program on different number of nodes we have recognized some problems with fixing of which the efficiency of MC# implementation could be much higher.
Most of these problems are connected with Mono implementation of .Net platform. Below we will mention the most important of these problems.
þ Efficiency of nested loops.
Current version of Mono uses only 3 registers for implementation of nested loops while Microsoft .NET uses more registers, in particular it uses additional registers eax, ecx, edx.
As a result, for the following test program:
|
using System; public class
NestedLoops { public static void { int i
= 0; int
t1 = Environment.TickCount; for (
int i1 = 0; i1 < 1000; i1++ ) for (
int i2 = 0; i2 < 1000; i2++ )
for ( int i3 = 0; i3
< 1000; i3++ ) { i++; } int
t2 = Environment.TickCount; Console.WriteLine(
( t2 - t1) + " ticks" ); } } |
Execution time (in ticks) under Mono on processor Athlon MP 2000+ is the following:
|
$
mono NestedLoops.exe 2049 ticks |
While the execution time under Microsoft .NET on processor Athlon XP 1600+ is the following:
|
>NestedLoops 1453 ticks |
For more information, see http://bugzilla.ximian.com/show_bug.cgi?id=75451
þ General improvement of pre-compiling and
optimization of the code.
Nowadays emitting and pre-compilation facilities for C# programs of Mono system are not as efficient as implemented in Microsoft.NET.
Here is the program that illustrates this:
|
using System; public class
Calc_Model { public static void { int
mij = 0; int i,
i3; int j,
j3; double[] x0 = new double [3] { 1.0, 2.0, 3.0 }; DateTime dt1
= DateTime.Now; for ( int ii = 0;
ii < 700000; ii++ ) { mij++; i = ii / 10; i3 = 3 * i; for ( int jj = 0; jj < 10; jj++ ) { j = jj / 10; j3 = 3 *
j; x0 [0] =
x0 [0] - x0 [1]; x0 [1] =
x0 [1] + x0 [2]; x0 [2] =
x0 [2] - x0 [0];
if ( x0 [0] >= 0.5 ) x0 [0] =
x0 [0] - x0 [1];
if ( x0 [1] >= 0.5 ) x0 [1] =
x0 [1] - x0 [2];
if ( x0 [2] >= 0.5 ) x0 [2] =
x0 [2] - x0 [0]; } } DateTime dt2
= DateTime.Now; Console.WriteLine(
"Elapsed time = " + dt2.Subtract(dt1).TotalSeconds ); } } |
Execution time of this program under Mono on processor Athlon MP 2000+ is the following:
|
$
mono Calc_Model.exe Elapsed time = 0.408045 |
While the execution time under Microsoft .NET on processor Athlon XP 1600+ is the following:
|
>Calc_Model Elapsed time = 0.109375 |
þ Channel implementation improvements
Mechanisms of channels of MC# language that are being used for transferring messages require the further improving both from theoretical point of view and from the point of view of more effective implementation.
Nowadays we are developing new modification of the language – MC# 2.0 that will contain more sophisticated mechanisms of interactions of parallel processes based on the channels and handlers of channel messages. Also we are going to enlarge the set of methods for working with channels by adding the methods that will use asynchronous model and buffered methods as well as operations for groups of channels like SendToAll and etc.
Our experience of porting such big computational application to MC# programming language as ALCMD has shown that:
[1] MC# Official Site – http://u.pereslavl.ru/~vadim/MCSharp/
[2] ALCMD Official Site – http://cmp.ameslab.gov/cmp/CMP_Theory/
[3] Cw Official Site – http://research.microsoft.com/comega/
[4] Mono Official Site – http://www.mono-project.com/
|
|
Multiprocessor |
C# |
|
.net Platform based |
|