Friday, March 14, 2008

the name's cluster, embarassed cluster

google! we finally got mpd running without any issues whatsoever on 2 comps.. !

eperimentation continued till about 2 yesterday, when we confgured a new linux user (and group) on each node.. and (re)installed mpi on each.. also we configured nfs (network file sharing) so that we could code on one computer and simply execute them from a common folder

the issues we sorted out yesterday were
  1. the pwd problem: earlier when we used mpi with different usernames on each node, a mpiexec -n 10 pwd returned the correct location only on the computer on which the command was executed and defaulted to '/' on other nodes. we figured that this was due to the absolute location being different on each node.. hence we added a linux user account with the same name and home directory location on each node (/home/cluster/) hence even all relative paths given in any mpiexec command mapped to the same absolute path on each node..
  2. the nfs problem: after setting /home/cluster/nfs to be shared from pv's computer (as server) and allowing other nodes access to this folder via /etc/exports and /etc/hosts.allow, we tried mounting this drive on the other nodes (my comp only for the time being). however, read-write permissions seemed a bit elusive at the start. in fact on mounting, the owner and group of the mounted shared directory were assigned to up and nobody. unfortunately this prevented recursive write permissions to the folder. after a bit of googling, we found that the user id (uid) and group id (gid) of the owner and group of the shared folder should be same on both the server and all nfs clients. to sort this out.. we deleted the user cluster and created (yes, again) cluster on each node with a uid=1042 and gid=1042 (yes, yes, we like 42 very much, thank you). then remounted the nfs folder.. and there!.. we had owner=cluster, and group=cluster. then we reinstallled mpi on the cluster @each node.. reset ssh-keygen, etc etc. and tried mpiexec -l -n 10 mpich2-1.0.6p1/examples/cpi. all sorted thanks to 42 and a lot of simple brainwork
this should be very simply scalable to all new nodes.. (vinayakzark, vai.sinh, kk).. the ssh problem with kk's sshd still remains to be figured.. so we're keeping it out of the ring for the moment.. now its a simple matter of running some custom applications on the mpi platform.. maybe we could try AMBER9 or something that already uses MPI as a parallel computing framework. so i guess our immediate objectives are the following
  1. get_new_nodes(void)
  2. get_a_software_to_run_on_them(void)
foobar to pv: we are green to go. i repeat, we are green to go. do you copy?

Friday, March 7, 2008

Third weak week

With the problems faced earlier we decided to start all over again. And this time we had 5 nodes (tgwtt's scribbler's proliferous in the ring, :P). The problem persists with one of them and we blame it on the sshd on that comp. For the time being, it has been isolated from the ring.
So, after setting up ssh for MPI the next very step - installing MPICH2 on the two nodes it wasn't already on; we successfully did it on proliferous whereas we'll have to wait till the next day for cluster to be ready with it.

failed to handshake with mpd on recvd output={}

Finally our focus shifted from the ssh-ing problem to a new one. The very first command mpdboot gave an error. We figured this out to be a hostname resolution problem and so we modified the /etc/hosts files on comps we had the su permissions of. And so we had to say goodbye to proliferous too for the time being. With the sshd problem not resolved yet on krishna and MPICH2 not installed yet on cluster, we were now left with only two nodes.
With this problem fixed we proceeded to the next command. All this looked pretty simple until we'd encountered the problems in every command we gave. It was a late realization that this was happening and we had to search exaustively to get the problems solved... hmm or are they solved!?

problem with execution of [Errno 2] No such file or directory

Next, with mpdboot working now, we proceeded to giving something to the ring to execute. mpiexec worked well when we executed files in /bin or any other path in $PATH of all the nodes. Where we met the next obstacle was in executing a file on some path not already in $PATH, for instance the home directory of the node user accounts! We tried to fix this as follows:
copied the file onto every node's home dir-> ran mpiexec, but... -> got the same error. Obviously this thing wasn't looking for the file where we'd expected it would. Our doubts were confirmed on giving mpiexec -n 2 pwd. This displayed / and /home/hollow, hollow being my node indicating that on the other nodes, it looks for the file on / itself!
To deal with this, we added /home/deepcyan in $PATH of node deepcyan. This still didn't work. We now can identify this problem as being one where we want to run two different programs on two different nodes using the same mpiexec. We weren't even using ":" for our purpose.
In searching for a solution we came across NFS and how it can be used for this purpose. That's when it struck us. We had to run the same program on different nodes parallelly right! Thanks to the links [1] and [2] we setup and configured an NFS server and a single client for the time being.

mpiexec: failed to obtain sock from manager

Hmm this is what we are currently facing, some NFS configuration problem most probably. Its like a video game. You need to fight a monster to go on the next level to fight a bigger one. Right now the game's saved at this level. I do hope we complete all levels someday.

IEEE Transactions on Parallel and Distributed Systems : latest TOC