embarassinglyDistributed: 2008

Tuesday, November 4, 2008

para3dhwt

Its been a snail's pace thanks to arbit errors and the debugging involved. We have adopted the modular approach making separate files for all the functions involved, keeping the related ones in the same file. What it has cost us at a speed of around 0.25x for just 3dhwt is a time of around 15 seconds for a 100 frame chunk. Thats at least better than the 38 minutes it once took! I seriously wish this gives a good compression over the usual stuff for all the time its taking... and this was just for order 1 wavelets! Well, we do have some optimizations already in mind; leaving apart the init 2 without X and gdm that we have come down to.

We have a few major hurdles to overcome though - one is the time factor, this time meaning the deadline for the BTP after which it is as good as dead; a second staring us in the face is the disappearance of nodes from the cluster owing to the upgradation of the lab that's going on... No we still won't be given the new ThinkSmarts to work on and no we (=>at least I) don't intend to appeal either for it'll mean wasting another weekend over installation of the components.

Friday, October 31, 2008

Installing OpenCV with ffmpeg

Making it work took some time and effort and repetition owing to the 4 nodes being individually separate entities. Here's what we and the ffmpeg and opencv tarballs had to go through

untar ffmpeg
./configure --enable-shared --enable-swscale --enable-gpl
make
sudo make install
untar opencv
sudo apt-get install patch ;if not already installed
patch otherlibs/highgui/cvcap_ffmpeg.cpp ../nfs/opencv-1.0.0-cvcapffmpegundefinedsymbols.patch
#4 from the page http://www.rainsoft.de/projects/ffmpeg_opencv.html
su
cd /usr/local/include/
mkdir ffmpeg
cp libavcodec/* ffmpeg/
cp libavdevice/* ffmpeg/
cp libavformat/* ffmpeg/
cp libavutil/* ffmpeg/
cp libswscale/* ffmpeg/
exit
change FFMPEGLIBS="-lavcodec -lavformat" to FFMPEGLIBS="-lavcodec -lavformat -lswscale" in configure
./configure --enable-shared
make
sudo make install
sudo ldconfig

Sandy instead made a patch file for both step #8 and #10 so if anyone ever reads this and needs them, can contact him.

Wednesday, June 4, 2008

From the Horse's mouth: a copy of the MPI reference manual

I'm diving head first into MPI and related PPing since:
1. its fun and I always wanted to do it!
2. I need to do it for my summer project at IISc which will otherwise take months on the single/dual processor machines.
I was surprised when Hollow told me that you did not have the full manual for MPI (MIT press).
I am a bit aware of some of the problems you ran into while setting up the Beowulf cluster. I will try to get solutions for these since there are people here who are proficient at this sort of stuff.
In any case, I will be adding some notes which I feel are important from the reference manual under this tag. Also, I'll post some bioinformatics problems that can be done when the cluster is up and running on all its feet.

Friday, March 14, 2008

the name's cluster, embarassed cluster

google! we finally got mpd running without any issues whatsoever on 2 comps.. !

eperimentation continued till about 2 yesterday, when we confgured a new linux user (and group) on each node.. and (re)installed mpi on each.. also we configured nfs (network file sharing) so that we could code on one computer and simply execute them from a common folder

the issues we sorted out yesterday were

the pwd problem: earlier when we used mpi with different usernames on each node, a mpiexec -n 10 pwd returned the correct location only on the computer on which the command was executed and defaulted to '/' on other nodes. we figured that this was due to the absolute location being different on each node.. hence we added a linux user account with the same name and home directory location on each node (/home/cluster/) hence even all relative paths given in any mpiexec command mapped to the same absolute path on each node..
the nfs problem: after setting /home/cluster/nfs to be shared from pv's computer (as server) and allowing other nodes access to this folder via /etc/exports and /etc/hosts.allow, we tried mounting this drive on the other nodes (my comp only for the time being). however, read-write permissions seemed a bit elusive at the start. in fact on mounting, the owner and group of the mounted shared directory were assigned to up and nobody. unfortunately this prevented recursive write permissions to the folder. after a bit of googling, we found that the user id (uid) and group id (gid) of the owner and group of the shared folder should be same on both the server and all nfs clients. to sort this out.. we deleted the user cluster and created (yes, again) cluster on each node with a uid=1042 and gid=1042 (yes, yes, we like 42 very much, thank you). then remounted the nfs folder.. and there!.. we had owner=cluster, and group=cluster. then we reinstallled mpi on the cluster @each node.. reset ssh-keygen, etc etc. and tried mpiexec -l -n 10 mpich2-1.0.6p1/examples/cpi. all sorted thanks to 42 and a lot of simple brainwork

this should be very simply scalable to all new nodes.. (vinayakzark, vai.sinh, kk).. the ssh problem with kk's sshd still remains to be figured.. so we're keeping it out of the ring for the moment.. now its a simple matter of running some custom applications on the mpi platform.. maybe we could try AMBER9 or something that already uses MPI as a parallel computing framework. so i guess our immediate objectives are the following

get_new_nodes(void)
get_a_software_to_run_on_them(void)

foobar to pv: we are green to go. i repeat, we are green to go. do you copy?

Friday, March 7, 2008

Third weak week

With the problems faced earlier we decided to start all over again. And this time we had 5 nodes (tgwtt's scribbler's proliferous in the ring, :P). The problem persists with one of them and we blame it on the sshd on that comp. For the time being, it has been isolated from the ring.
So, after setting up ssh for MPI the next very step - installing MPICH2 on the two nodes it wasn't already on; we successfully did it on proliferous whereas we'll have to wait till the next day for cluster to be ready with it.

failed to handshake with mpd on recvd output={}

Finally our focus shifted from the ssh-ing problem to a new one. The very first command mpdboot gave an error. We figured this out to be a hostname resolution problem and so we modified the /etc/hosts files on comps we had the su permissions of. And so we had to say goodbye to proliferous too for the time being. With the sshd problem not resolved yet on krishna and MPICH2 not installed yet on cluster, we were now left with only two nodes.
With this problem fixed we proceeded to the next command. All this looked pretty simple until we'd encountered the problems in every command we gave. It was a late realization that this was happening and we had to search exaustively to get the problems solved... hmm or are they solved!?

problem with execution of [Errno 2] No such file or directory

Next, with mpdboot working now, we proceeded to giving something to the ring to execute. mpiexec worked well when we executed files in /bin or any other path in $PATH of all the nodes. Where we met the next obstacle was in executing a file on some path not already in $PATH, for instance the home directory of the node user accounts! We tried to fix this as follows:
copied the file onto every node's home dir-> ran mpiexec, but... -> got the same error. Obviously this thing wasn't looking for the file where we'd expected it would. Our doubts were confirmed on giving mpiexec -n 2 pwd. This displayed / and /home/hollow, hollow being my node indicating that on the other nodes, it looks for the file on / itself!
To deal with this, we added /home/deepcyan in $PATH of node deepcyan. This still didn't work. We now can identify this problem as being one where we want to run two different programs on two different nodes using the same mpiexec. We weren't even using ":" for our purpose.
In searching for a solution we came across NFS and how it can be used for this purpose. That's when it struck us. We had to run the same program on different nodes parallelly right! Thanks to the links [1] and [2] we setup and configured an NFS server and a single client for the time being.

mpiexec: failed to obtain sock from manager

Hmm this is what we are currently facing, some NFS configuration problem most probably. Its like a video game. You need to fight a monster to go on the next level to fight a bigger one. Right now the game's saved at this level. I do hope we complete all levels someday.

Wednesday, February 20, 2008

so we want to do a beowulf?

ya right.. considering the extremely boring notion of running a simulation for a week and then finding out that the simulation parameters were wrong in the first place.. and then having to re-run it again and again.. (we're iterative learners, but you know that right?). i've been through that before and dont intend to be that again..

hence the beowulf.. now time for some terminology clarification..

beowulf, as you must have guessed is not beowulf, the movie. seriously.. no one can do a movie.. a group of linux/unix machines doing the same code is more like it.. something like a 400 core processor.. hah! we beat core2duo big time.. anyways.. most supercomputers are something like large beowulfs..
next comes dear MPI. message passing interface: this dude-ic C/C++ library allows such a these machines we talked about to communicate and do the codes we talked about without over- or under-doing it.
SSH: the backbone.. MPI executes commands through secure shell access

as for what we plan to do with this monster of a cluster, we haven't a final idea.. what i'd propose is some kind of dna simulation for a start, since i'm already familiar with the software and procedures.. other things that can be done would be doing the mersenne thing (http://www.mersenne.org/), as suggested by vinayakzark, who shall be generously contributing to the cluster soon. more ideas are awaited..

Saturday, February 16, 2008

Sharing our sharing experience

MPI was successfully installed on 3 comps today (Thanks to Krishna for giving his comp for this project). We wrote down same passwords in the .mpd.conf files on the 3 nodes (TMI I know). We changed the ssh too and our firewall settings. Running the client-server pairs on two computers worked successfully. However, we encountered problems trying to make a ring running mpdboot on ssh on the 2 other nodes from the same comp. Right now we attribute this problem to ssh sessions requiring passwords. After checking out the links [1] and [2] we hope to get the thing fixed tomorrow using silent logins.

embarassinglyDistributed