Friday, March 7, 2008

Third weak week

With the problems faced earlier we decided to start all over again. And this time we had 5 nodes (tgwtt's scribbler's proliferous in the ring, :P). The problem persists with one of them and we blame it on the sshd on that comp. For the time being, it has been isolated from the ring.
So, after setting up ssh for MPI the next very step - installing MPICH2 on the two nodes it wasn't already on; we successfully did it on proliferous whereas we'll have to wait till the next day for cluster to be ready with it.

failed to handshake with mpd on recvd output={}

Finally our focus shifted from the ssh-ing problem to a new one. The very first command mpdboot gave an error. We figured this out to be a hostname resolution problem and so we modified the /etc/hosts files on comps we had the su permissions of. And so we had to say goodbye to proliferous too for the time being. With the sshd problem not resolved yet on krishna and MPICH2 not installed yet on cluster, we were now left with only two nodes.
With this problem fixed we proceeded to the next command. All this looked pretty simple until we'd encountered the problems in every command we gave. It was a late realization that this was happening and we had to search exaustively to get the problems solved... hmm or are they solved!?

problem with execution of [Errno 2] No such file or directory

Next, with mpdboot working now, we proceeded to giving something to the ring to execute. mpiexec worked well when we executed files in /bin or any other path in $PATH of all the nodes. Where we met the next obstacle was in executing a file on some path not already in $PATH, for instance the home directory of the node user accounts! We tried to fix this as follows:
copied the file onto every node's home dir-> ran mpiexec, but... -> got the same error. Obviously this thing wasn't looking for the file where we'd expected it would. Our doubts were confirmed on giving mpiexec -n 2 pwd. This displayed / and /home/hollow, hollow being my node indicating that on the other nodes, it looks for the file on / itself!
To deal with this, we added /home/deepcyan in $PATH of node deepcyan. This still didn't work. We now can identify this problem as being one where we want to run two different programs on two different nodes using the same mpiexec. We weren't even using ":" for our purpose.
In searching for a solution we came across NFS and how it can be used for this purpose. That's when it struck us. We had to run the same program on different nodes parallelly right! Thanks to the links [1] and [2] we setup and configured an NFS server and a single client for the time being.

mpiexec: failed to obtain sock from manager

Hmm this is what we are currently facing, some NFS configuration problem most probably. Its like a video game. You need to fight a monster to go on the next level to fight a bigger one. Right now the game's saved at this level. I do hope we complete all levels someday.

2 comments:

Anonymous said...

Amiable dispatch and this enter helped me alot in my college assignement. Gratefulness you for your information.

Anonymous said...

Sorry for my bad english. Thank you so much for your good post. Your post helped me in my college assignment, If you can provide me more details please email me.

IEEE Transactions on Parallel and Distributed Systems : latest TOC