Pages

Monday, December 13, 2021

The riveting tale of "why it doesn't work yet"

This semester, we have been working on running a software pipeline—a series of programs that each rely on the output of the previous program—which was written to analyze videos of ants moving in an artificial tree and track the movements of the ants. The main characters in this story are me, my partner, the pipeline, and the computer that the code runs on, whose name is Purves. 

Chapter 1: Permission to Execute?

On Purves, an HMC Bio department computer which is shared across multiple groups of people, it logically makes sense for not everyone to be able to access all of the files. After all, some random person doing Bio research should not be able to edit the code for someone else’s CS research, and a student doing homework for a class shouldn’t be able to see someone else’s solution to the same problem.

To see what the permissions for each file in the current directory are, type ls -l in a command terminal. ls means "list," and -l is an optional flag that shows permissions. The permissions can be different for the user who created the files, other users in the same group, and the general other user. For example, you might see something like this:


1 % ls -l

total 4

drwxrwxrwx 2 elucasfoley students 4096 Oct 12 10:10 foo

-rwx------ 1 elucasfoley students    0 Dec  3 09:25 myprogram.py

-rw-r--r-- 1 elucasfoley students    0 Nov 30 12:34 README.txt

(base) 9:33 [elucasfoley@purves:~/Desktop/test]

[1] A sample terminal interaction

The permissions are the first column. The available permissions are read (denoted r, which is the ability to look at the contents of a file), write (w, which is editing capability), and execute (x, which is to run the file). A dash indicates that such an action is prohibited. 


[2] A guide to file permissions

This indicates that foo, a directory, can be read, written on, or executed by anybody. I can read, write, and execute myprogram.py, but nobody else can, and README.txt can be read by anyone but only edited by me.

By default, Purves makes new files accessible only to the user who created them. For us, this meant that new files (including the scripts of the project itself, which we were trying to run) created by me were not accessible to my partner, and vice versa, and files that were worked on last semester by previous students on the project were accessible to neither my partner nor me. 

The solution is a handy command, chmod. chmod, which stands for “change mode,” is given two arguments, one for how to change the mode (that is, the desired set of permissions) and another for the file whose permissions are changing. Just like with ls, you can specify flags for certain extra options. I often use something like chmod g+rwx myprogram.py, which means to add all permissions to anyone in our lab group, and leave the other settings alone. You can also say a+rwx to add read and write permissions to all categories, or o-w to remove write permissions from “other” (people not in the group), or u+wx to add write and execute permissions for “user” (the owner of the file), or go+w to add write permissions for group members and others, and so forth. 

Fancy flag alert! This chmod thing is useful, but it can get tedious if you have to change the modes of many files (which I found myself needing to do sometimes). To edit the permissions of all files in the current location on Purves, you can say * instead of specifying the name of the file, e.g. chmod g+rwx *. And, to recursively change the permissions of all files and directories in the current folder, you can use the flag -R, like chmod g+rwx * -R. In my example above, this would add all group permissions to all the files, including ones stored in the foo directory, or in directories in the foo directory—which is really useful!

At this point, my partner and I were able to look at all the code, and the pipeline had permission to access, create, and edit all its relevant files. 

Chapter 2: Where o Where?

As stated before, the end goal of this adventure is to analyze videos and track the movements of ants in them. So, we needed to tell the pipeline where to find the starting input videos. Since the project is shared among a large group of people, it makes the most sense generally for the ant videos to be stored in some central location. And since the pipeline creates a bunch of intermediate files and outputs, it makes the most sense for those generated files to be stored in the same directory as the code. 

To find the inputs and intermediates while running the pipeline, we use Snakemake. Snakemake is a chain of “rules” with expected inputs and desired outputs. To figure out where the files are found, what Snakemake does, at least in our current system, is try to create output with the same name as the given input (but a different file extension), then check which inputs are required in the rule for generating this output. It seems convoluted but this actually prevents Purves from doing more work when some intermediates have already been created. The issue is that Snakemake expects the code, inputs, intermediates, and outputs all to be in the same directory (or in appropriate input, intermediate, and output subdirectories of the current directory). 

Ordinarily, to exit the current directory, the command cd .. is used (cd stands for change directory). For example, if the folder c is contained in the folder b which is contained in the folder a, we could do 

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo]

3 % cd a

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a]

4 % cd b

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a/b]

5 % cd c

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a/b/c]

6 % cd ..

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a/b]

7 % cd ..

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a]

8 % cd b/c

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo/a/b/c]

9 % cd ../..

(base) 9:35 [elucasfoley@purves:~/Desktop/test/foo]  


[3] A sample terminal interaction

Navigating the directories in this way through the terminal works fine, but when Snakemake sees something including .. in one of its rules, it doesn’t know where to look. And, not being able to actually locate the inputs, or even know what inputs to look for, makes it impossible to literally run the code. 

Our simple workaround here was to copy a few videos over into the same directory as the pipeline itself. Then, when the program goes looking for the files, it finds them in the expected subdirectory, and exhibits the expected behavior. This isn’t ideal, since the files are very large—copying everything over would be a waste of space and time—but for testing purposes, until we get the program to function satisfactorily, this is sufficient. 

Finally, we had our directory set up so that it looked how it needed to. But, the code still did not execute...

Chapter 3: So much depends

Unfortunately for us, the pipeline is much more than just the code and the input files. It also has lots of dependencies: other python modules/plugins which provide extra functionality that the original authors of the code, genius students of past semesters, were too lazy to code themselves. Every time we tried to run the code, a different module would be marked as missing—despite the fact that I had supposedly downloaded all the dependencies, according to the documentation. So, install that dependency, run the code again, see that another dependency was missing, install that, rinse and repeat. 

This sounds easy, but some of the dependencies (in particular OpenCV, the primary module used for computer vision/looking at pictures with a computer) are quite large, and needed time to download. Furthermore, some modules were not installed correctly, or only worked in certain versions of Python, so the task became a race around the Google and the Stack exchanges to try to solve it and figure out why. 

Eventually, though, with some help from our ancestors—that is, students who worked on the project in past semesters—we did manage to get everything installed properly. Now, the code should run, right?

Chapter 4: Blank Bounding Boxes, oh Boy!

Well, yes and no. Indeed, the code runs without throwing any errors, at last. We have permissions to access and edit all the requisite files, the program can find all the inputs that it uses, and the code manages to call upon all its dependencies. But, its output is incorrect. Instead of detecting useful regions of interest—the forks in the path of the tree—and cropping the video to them, like this:

[4] A properly cropped region of interest  

most of our cropped regions of interest look like this: 

[5] An incorrectly detected region of interest

It’s likely that there is an issue with one of the dependencies or versions. The mystery is still afoot...

Chapter 5: Documentation and Docker?

Obviously, improving the documentation would help us, and future students, in the plight of understanding how the code works and how to run it. Heading into the last few weeks of the semester, we intend to update the README file and other general documents with everything we have learned, so that the pipeline is easier to set up and run successfully. 

Another possible action to improve setup or prevent setup confusion could be to use Docker. Docker uses a container system to prepackage environments for users. This means that, once a Docker container is initially set up, it could be used universally with virtually no effort by anyone on the team. Furthermore, the pipeline itself could eventually be included in the container, making it possible to run the program without downloading it separately. Going forward, it could be a good idea to use Docker for all our projects in the Bee Lab. 

Further Reading

Media Credits

[1-3] Figures created by the author

[4] Computer generated output, courtesy of Catherine Wu

[5] Computer generated output


No comments:

Post a Comment