Welcome to the homepage of MC# language!
MC# Project
Home page 
MC# language 
Documentation 
Publications 
 Code examples 
FAQ 

 
Downloads
MC# Programming 
System
 

 
Related links
Parallel C# 
Polyphonic C# 
SKIF Project 

 
Contacts
 Contacts 


Mono powered

Microsoft .Net powered


 
   Parallel programming on MC#: an experience of implementation of ALCMD molecular dynamics code

Parallel programming in MC#: an experience of implementation

of ALCMD molecular dynamics code

 

Technical Report

 

Program Systems Institute of

Russian Academy of Sciences,

Pereslavl-Zalessky,

July 9, 2005

 

 

 

Multiprocessor

C#

.net Platform based

 

Table of contents

 

Parallel programming on MC#: an experience of implementation of ALCMD molecular dynamics code  1

1. Introduction.. 2

2. The basics of MC# language. 2

3. The computational molecular dynamics package ALCMD (Ames Lab Classical Molecular Dynamics) 3

4. Implementation of ALCMD package on MC# language. 3

4.1. Movable methods as virtual processors. 4

4.2. Bi-directional channels as a core of the interactions between movable methods (virtual processors). 4

4.3. Blackboard – method of implementation of collective operations in MC# language. 5

5. Comparison of ALCMD implementation on MC# language with original ALCMD code. 7

5.1. Tests results. 8

5.1. Issues of result comparability. 9

5.2. Further work on improving Mono implementation of .Net platform and MC# programming system.. 11

6. Conclusions. 13

References. 13

 

 

1. Introduction

 

This report contains the details about the implementation of the ALCMD package (Ames Lab Classical Molecular Dynamics, http://cmp.ameslab.gov/cmp/CMP_Theory/cmd/alcmd_source.html) on the parallel programming language MC# (http://u.pereslavl.ru/~vadim/MCSharp)

 

The main purpose of this project is to estimate whether it is possible to implement complex computational programs (that are usually implemented on C and FORTRAN languages with the help of some communication libraries like MPI) on byte-code languages like C# in managed runtime environments.

 

In this project we used a Mono platform – a free implementation of .Net platform for Unix-like platforms (http://www.mono-project.com/).

 

2. The basics of MC# language

 

MC# language is the universal high-level programming language based on C# language the main purpose of which is the calculations on cluster architectures.

 

Specific features of the MC# language consist in the transferring of asynchronous parallel programming model of Polyphonic C# (Benton, Cardelli, Fournet, Microsoft Research Laboratory, Cambridge, UK) to distributed case.

 

Asynchronous methods of Polyphonic C# (nowadays known as Cw) in MC# language can be planned for execution on the remote machine (for example it could be a node of the cluster) and are known in MC# as movable methods.

 

Interaction between movable methods that are executed on different nodes can be realized through the channels – a special syntactic class in MC# language. For synchronization purposes MC# uses the bounds that are analogue of the chords of Polyphonic C#.

 

Though the channels essentially are one-way entities (same as in join-calculus) in MC# there were introduced so called “bi-directional channels” that exist in the system as a separate class BDChannel. Using this kind of channels you can send and receive values at the same time from the same channel.

 

More information about MC# language, including articles and examples of implemented applications is available on the site of the MC# project: http://u.pereslavl.ru/~vadim/MCSharp/

 

3. The computational molecular dynamics package ALCMD (Ames Lab Classical Molecular Dynamics)

 

The computational molecular dynamics package ALCMD is intended for modeling and carrying out of numerical experiments on studying and revealing of properties of complex many-partial systems with various types of intermolecular interaction.

This package was developed at Condensed Matter Theory group at Ames Laboratory (Indiana State University) and nowadays is not supported anymore.

 

In this report we describe the implementation of the ALCMD package on MC# language that supports atoms interaction based on Leonard-Jones potentials.

 

In original ALCMD package there was implemented an algorithm based on spatial decomposition. This algorithm allows one to use it on a wide range of massive parallel computers.

 

As a communication library ALCMD uses MP_Lite library that can be used over any implementation of MPI.

 

It worth to note that original implementation of ALCMD package has rough errors and problems that interfere with effective scaling using the supercomputer systems with big number of processors. See section 5 in this document for more details.

 

4. Implementation of ALCMD package on MC# language

 

Original ALCMD package was implemented on FORTRAN language and from the beginning was oriented at the MPI communication library. This has imposed extra constraints on MC# implementation as MC# uses more abstract model of parallel calculations then MPI.

Actually our implementation is a porting of original MPI version of ALCMD package to the MC# language. In particular this has defined the general computing scheme of implementation based on MC# language.

 

ALCMD package was made for modeling systems that consist of big number of particles (atoms) with the help of iterated calculations of the coordinates of particles in 3D space and some integral properties (like temperature, kinetic and potential energy and etc.) of the whole system. To parallel such kind of computational problem the following scheme is used:

1)      each processor calculates his own set of particles;

2)      on each iteration processors-“neighbors” exchange the borderland particles (in terms of some spatial grid);

3)      Set of collective (global) operations performed on each iteration, like Min, Max, Sum and etc., based on the values stored on “virtual processors”.

 

To solve each of these tasks specific features of MC# language can be used.

 

4.1. Movable methods as virtual processors

 

Unlike MPI implementation where on each working processor (node) own copy of the main program is launched in programs that were written in MC# language will be applied dynamic creation of parallel processes.

More specifically, application on MC# language is launched on the main node of the cluster. During the calculation session this application usually launches methods that can be scheduled for execution to other processors (nodes). These methods in turn can launch new movable methods. MC# Runtime system is responsible for balancing and distributing the movable methods across the cluster.

 

In the implementation of ALCMD package on the MC# language main program ControlALCMD launches a lot of movable methods called Mdsd. These methods can be considered as virtual processors where each movable method computes its own set of particles:

 

 

Pic.1. Interaction of the main program with movable methods.

 

Main program passes the following parameters to movable methods:

 

1)      The number of current movable method (“virtual processor”)

2)      Total number of launched movable methods

3)      Name of the file with input data

4)      Channels which will be used for interactions between main program and movable methods.

During the calculations launched movable methods exchange the data with main program as well as with other movable methods.

 


4.2. Bi-directional channels as a core of the interactions between movable methods (virtual processors)

 

As according to the algorithm of ALCMD it is unknown which pairs of movable-methods (“virtual processors”) will interact among themselves, so in MC# implementation we suspect that every movable method can interact with each other.

Mechanism of messages selection that is implemented in MPI with the help of tags can be modeled if we create group of bi-directional channels for each of movable methods. Each channel in such group can receive messages from particular movable method (“virtual processor”):

 

Pic. 2. Input channels of Virtual processor

 

When creating of such groups “virtual processors” can exchange these channels among themselves (in current implementation this is done through the main program at the beginning of the calculations). That’s how on each virtual processor there appear a group of output channels. Using this output channels processors can send data to other virtual processors:

 

 

Pic. 3. Communication structure of Virtual processor

 

These groups of channels are used for exchanging the borderland particles between neighbors.

Also in addition to such techniques in ALCMD program there do exist some collective (global) operations that do require values for aggregation from all virtual processors.

 

4.3. Blackboard – method of implementation of collective operations in MC# language

 

For implementation of collective operations like Min, Max, Sum and etc. for scalar values and arrays in MC# implementation of ALCMD package we use the idea of “Blackboard” – global structure that is accessible for writing and reading for each virtual processor.

For effective work with such structure we implemented its distributed variant where each virtual processor has its own local copy of BlackBoard. When some value is written to such structure it is copied to corresponding structures of all other copies of BlackBoard structure through the sending of channel messages. That is why the execution of global operation can be generalized to the reading of local values that were received by virtual processor and making some calculations with these values.

Here is the implementation of BlackBoard class that implements the Sum operation for numbers of type Double:

 

using System;

 

public class Blackboard  

{

 private BDChannel[] inChannels;

 private BDChannel[] outChannels;

 private int size;

 

 public Blackboard( int size )   

 {

  inChannels = new BDChannel [size];

  for ( int i = 0; i < size; i++ )

   inChannels [i] = new BDChannel();

  this.size   = size; 

 }

 

 public void setOutChannels( BDChannel[] outChannels )  

 {

  this.outChannels = outChannels;

 }

 

 public void setValue( object o )  

 {

  for ( int i = 0; i < size; i++ ) outChannels [i].Send( o );

 }

 

 public double[] getdArraySum() {

  double[] darray, result;

  result = (double[]) inChannels [0].Receive() [0];

  for ( int i = 1; i < size; i++ )  

  {

   darray = (double[]) inChannels [ i ].Receive() [0];

   for ( int j = 0; j < darray.Length; j++ )

    result [ j ] += darray [j];

  }

  return result;

 }

 

 public BDChannel[] getInChannels() {

  return inChannels;

 }

}

 

5. Comparison of ALCMD implementation on MC# language with original ALCMD code

 

As an implementation of .Net platform we used Mono 1.1.8.2

 (http://www.mono-project.com/Downloads).

All tests were made on cluster SKIF (16 SMP nodes with processors Athlon MP 1800+) and cluster K-1000 (maximum number of nodes – 288 with Opteron processors).

The results of tests that can be used to compare MC# implementation and original ALCMD code are shown below.

5.1. Tests results

 

Cluster “SKIF FIRST-BORN M” (Fast Ethernet, LAM/MPI with GNU FORTRAN for original ALCMD)

 

Parameters of ALCMD:

1) Number of atoms – 384000,

2) Number of iterations – 100.

 

Number of processors

MC#

Original ALCMD

1

15:27.2

7:41.8

2

8:17.5

4:41.0

4

4:42.6

2:40.0

8

2:33.3

1:25.0

16

1:28.2

0:48.1

24

1:23.9

0:46.6

32

1:38.1

0:40.4

 

 

 

Cluster “SKIF K-1000” (Gigabit Ethernet, LAM/MPI with PGI FORTRAN for original ALCMD)

 

Parameters of ALCMD:

1) Number of atoms – 768000,

2) Number of iterations – 100.

 

Number of processors

MC#

Original ALCMD

1

19:07.2

8:56.6

2

9:56.2

4:26.4

4

5:09.9

2:16.2

8

2:37.4

1:09.2

16

1:29.8

0:35.6

32

2:12.7

0:58.6

48

3:22.8

0:54.5

 

 

5.1. Issues of result comparability

 

In order to compare the results of calculations of MC# version and original version we fixed in both versions the original positions of atoms (without these changes all atoms before the calculations get the random movements from ideal positions).

In most cases MC#-version gives the similar results when comparing with original version. Here is the example of output for the system that contains 384000 atoms for both versions.

 

Original ALCMD:

 

mpirun -np 8 mdlj

 CMD started on    8 processors

 384000 atoms total

 8 nodes arranged in   4 columns and   2 rows

 Using   1204 grids arranged  43 by  28  ( 2.30A by  2.36A)

 0.00% load imbalance before Redistribute(      48000      48000)

 Potential cutoff =   2.500000  2.875000

 Done calculating passing sequence.

 

 Node    0 has   48000 atoms and passes    4 times

    4    3    4    1

 Starting the main loop.

 86.000 nbors/atom --> passing   11281 of   48000 atoms (max stray = 0.000 A)

 

      Step     Atoms     Temp     Kinet En   Poten En   Total En     Change

        10    384000        0.0     0.0000    -7.1189    -7.1189     0.0000 0.0000E+00

        20    384000        0.0     0.0000    -7.8888    -7.8888    -0.7699 0.0000E+00

        30    384000        9.7     0.0013    -7.8911    -7.8899    -0.7709 0.0000E+00

        40    384000      172.4     0.0223    -7.8454    -7.8231    -0.7042 0.0000E+00

        50    384000      258.6     0.0334    -7.8582    -7.8248    -0.7059 0.0000E+00

        60    384000      186.8     0.0241    -7.8639    -7.8397    -0.7208 0.0000E+00

        70    384000      201.3     0.0260    -7.8657    -7.8397    -0.7208 0.0000E+00

        80    384000      236.0     0.0305    -7.8702    -7.8397    -0.7208 0.0000E+00

        90    384000      219.5     0.0284    -7.8681    -7.8398    -0.7208 0.0000E+00

       100    384000      193.4     0.0250    -7.8647    -7.8398    -0.7208 0.0000E+00

 

MC# ALCMD:

 

***   ALCMD MC# version 1.81 20-July-2005 started   ***

 

Output file name = md.out

384000 atoms total

8 nodes arranged in 4 columns and 2 rows

Using 1204 grids arranged 43 by 28 ( 2.30232558139535A by 2.35714285714286A )

 

0.00% load imbalance before Redistribute ( 48000   48000 )

Potential cutoff = 2.5   2.875

Done calculating passing sequence

 

Node 0 has 48000 atoms and passes 4 times

 

86 nborns/atom --> passing 11281 of 48000 atoms ( max stray =0 A)

  Step     Atoms      Temp     Kinet En      Poten En     Total En     Change

   10     384000     0.00         0.00               -7.12            -7.12         0.00   0

   20     384000     0.00         0.00               -7.89            -7.89        -0.77   0

   30     384000     3.18         0.00               -7.89            -7.89        -0.77   0

   40     384000     207.09     0.03               -7.85            -7.82        -0.70   0

   50     384000     175.36     0.02               -7.85            -7.82        -0.70   0

   60     384000     219.33     0.03               -7.87            -7.84         -0.72   0

   70     384000     185.83     0.02               -7.87            -7.84         -0.72   0

   80     384000     211.46     0.03               -7.87            -7.84         -0.72   0

   90     384000     203.29     0.03               -7.87            -7.84         -0.72   0

   100     384000   234.07     0.03               -7.87            -7.84         -0.72   0

 

Such discrepancy of results can be explained by the high computational difficulty of original FORTRAN source code, the precise translation of which requires considerable efforts and more time. This work is still not finished.

In particular “Redistribute” mode hasn’t been implemented correctly yet in MC# version – when different number of atoms is assigned to processors and a special procedure must be executed for redistribution of these atoms.

It worth to note that problem with correctness of the source code also exists in original FORTRAN code of ALCMD. When using some fixed distribution of unit cells in X, Y, Z- directions the results are different depending on the numbers of processors that are in use. This effect was confirmed by the developers of the original ALCMD package.

5.2. Further work on improving Mono implementation of .Net platform and MC# programming system

 

While we were testing the program on different number of nodes we have recognized some problems with fixing of which the efficiency of MC# implementation could be much higher.

Most of these problems are connected with Mono implementation of .Net platform. Below we will mention the most important of these problems.

 

þ     Efficiency of nested loops.

 

Current version of Mono uses only 3 registers for implementation of nested loops while Microsoft .NET uses more registers, in particular it uses additional registers eax, ecx, edx.

As a result, for the following test program:

 

using System;

 

public class NestedLoops

{

 public static void Main( string[] args )

 {

  int i = 0; 

  int t1 = Environment.TickCount;

  for ( int i1 = 0; i1 < 1000; i1++ )

   for ( int i2 = 0; i2 < 1000; i2++ ) 

    for ( int i3 = 0; i3 < 1000; i3++ )

    {

     i++;  

    }

  int t2 = Environment.TickCount;

  Console.WriteLine( ( t2 - t1) + " ticks" );

 }

}

 

Execution time (in ticks) under Mono on processor Athlon MP 2000+ is the following:

 

$ mono NestedLoops.exe

2049 ticks

 

While the execution time under Microsoft .NET on processor Athlon XP 1600+ is the following:

 

>NestedLoops

1453 ticks

 

For more information, see http://bugzilla.ximian.com/show_bug.cgi?id=75451

 

þ     General improvement of pre-compiling and optimization of the code.

 

Nowadays emitting and pre-compilation facilities for C# programs of Mono system are not as efficient as implemented in Microsoft.NET.

Here is the program that illustrates this:

 

 

using System;

 

public class Calc_Model

{

 public static void Main( string[] args )

 {

  int mij = 0;

  int i, i3;

  int j, j3;

  double[] x0 = new double [3] { 1.0, 2.0, 3.0 };

  DateTime dt1 = DateTime.Now;

  for ( int ii = 0; ii < 700000; ii++ )

  {

   mij++;

   i = ii / 10;

   i3 = 3 * i;

  

   for ( int jj = 0; jj < 10; jj++ )

   {

    j  = jj / 10;

    j3 = 3 * j;

    x0 [0] = x0 [0] - x0 [1];

    x0 [1] = x0 [1] + x0 [2];

    x0 [2] = x0 [2] - x0 [0];

    if ( x0 [0] >= 0.5 )

     x0 [0] = x0 [0] - x0 [1];

    if ( x0 [1] >= 0.5 )

     x0 [1] = x0 [1] - x0 [2];

    if ( x0 [2] >= 0.5 )

     x0 [2] = x0 [2] - x0 [0];

   }

  }

  DateTime dt2 = DateTime.Now;

  Console.WriteLine( "Elapsed time = " + dt2.Subtract(dt1).TotalSeconds );

 }

}

 

Execution time of this program under Mono on processor Athlon MP 2000+ is the following:

 

$ mono Calc_Model.exe

Elapsed time = 0.408045

 

While the execution time under Microsoft .NET on processor Athlon XP 1600+ is the following:

 

>Calc_Model

Elapsed time = 0.109375

 

 

þ     Channel implementation improvements

 

Mechanisms of channels of MC# language that are being used for transferring messages require the further improving both from theoretical point of view and from the point of view of more effective implementation.

Nowadays we are developing new modification of the language – MC# 2.0 that will contain more sophisticated mechanisms of interactions of parallel processes based on the channels and handlers of channel messages. Also we are going to enlarge the set of methods for working with channels by adding the methods that will use asynchronous model and buffered methods as well as operations for groups of channels like SendToAll and etc.

 

6. Conclusions

 

Our experience of porting such big computational application to MC# programming language as ALCMD has shown that:

  1. This language dramatically increases the speed of developing and debugging parallel programs;
  2. Using this language one can get a sufficient efficiency in case when all mentioned above defects will be fixed in the current implementation of .Net for OS Linux (Mono system).

 

References

 

[1] MC# Official Site – http://u.pereslavl.ru/~vadim/MCSharp/  

[2] ALCMD Official Site – http://cmp.ameslab.gov/cmp/CMP_Theory/

[3] Cw Official Site – http://research.microsoft.com/comega/ 

[4] Mono Official Site – http://www.mono-project.com/

 

 

Multiprocessor

C#

.net Platform based

 

 


Âåñü Ïåðåñëàâëü