Pages

Tuesday, December 7, 2021

Snake in a Knapsack

  At the core of data analysis programs is successful partitioning of important code down into distinct steps. If we only had one massive program, then it would be difficult to debug since any error could shut down the entire pipeline. That's where workflow management systems are necessary. They ensure all the many lower level programs run correctly with respect to other processes and allow a flexible way to run the program from different points in the overall process. In the Ant Research lab, we use Snakemake! Snakemake is a workflow management system designed to make reproducible and scalable data analysis.

    One of Snakemake’s biggest boons to computational analysis is its scalability. Pipelines using Snakemake are designed to be run on both smaller hardware like laptops, and run on large scale computer networks. It's important to be able to create our programs without specific hardware specifications in mind, but how does Snakemake optimize for such large system differences?

    At the core of Snakemake from a language standpoint is a set of “rules.” A rule in Snakemake (at its most basic) is a structure for how a program in the pipeline should run. It details what the input file should look like, where the output file should end up, and what dependencies that program has (what other files must exist before running the program, like the input). What order the various rules are run in can be very important for efficiency. We don’t want the entire pipeline waiting on one slow program. Processes should be running in parallel.

    Upon invocation, Snakemake creates a directed tree of dependencies based on the rules outlined in its construction. It does not run the programs, only looks at inputs, outputs, and dependencies. In the end, it has an (upside down) tree that looks like this:





Figure 1: Example Dependency Tree that could be created by
 Snakemake for a possible image processing pipeline. 
Each block is a rule/job that requires computation.


    The pipeline can recognize which steps have already been run successfully: if the correct file/output is present, Snakemake skips the rule to construct it and goes to the next block in the tree. That means errors in the pipeline won’t require all the previously successful processes to be rerun. This reduces the amount of computational work and time needed on subsequent runs of the pipeline.

    The key insight to optimizing running the above pipeline is that we can run each individual branch in parallel without interrupting the flow/dependencies of the programs. This increases the efficiency of running the pipeline. While we know the necessary ordering of the jobs, there is still the problem of optimizing the job order. If we can properly advocate resources to our parallel processes we can minimize unused resources.

    The main computational limit when running this tree of jobs is the number of CPU cores available to the process. This restriction is consistent between the largest and smallest hardware. CPUs are built of multiple cores connected to a single socket in the hardware of the computer. Within each core, there can be multiple “logical'' processors called threads. They are not physically another processor but they can function independently within the CPU core. Each thread can operate in relative parallel. Normally, starting a new job is slow, since we have to receive the data from I/O which, in short, is very slow to access. This slower job switch is called “context switching” and is something we want to avoid. The thread is actively part of the CPU, so we can have multiple threads on “standby” so they run in parallel to the other jobs but do not require context switching.

    According to one of the authors of Snakemake Köster Johannes, “Since individual jobs can use multiple threads themselves, Snakemake can be instructed to solve a 0/1-knapsack problem to optimize the usage of CPUs, given a threshold of available cores.“ That is, if we can optimize threads in the CPU cores we have (prioritizing processing bottlenecks based on the Snakemake tree), we can significantly increase the number of computations run in parallel. If the allocation and optimal prioritization of processes can be solved, then the tree lets us know exactly what resources we need and what jobs can be run in parallel threads. Therefore the required context switching is reduced and the pipeline is sped up.
But what is a “0/1-knapsack problem“? And how does it help here?

The 0/1-Knapsack Problem

    The 0/1 Knapsack problem is a “filling the knapsack” problem. We have a knapsack that can carry up to a certain weight without breaking. And then we have many items (let’s say jewels) that have a weight and value. The “0/1” in the name refers to the fact that we cannot break the “jewels” up. We must put all of the items, or none of it into the knapsack: hence the knapsack has a zero or one (0/1) property. The problem itself is how to put the maximum value in the knapsack without overloading the weight capacity.



  
 
 
Figure 2: Visual aid for jewel and knapsack metaphor. The values
 are listed on the jewels with red being worth 60, blue 100, and green 120.


    In the case above we want to optimize the weight and the value. We can just check every single outcome to determine that the 30 and 20 weight jewels are the most valuable and both fit in the bag, and since we can’t put all three, they are the best choice.




Figure 3: Example showing 4 possible combinations of jewels and their 
resulting value. The Solution to the puzzle is one green and one blue jewel


    The problem gets harder as you add more jewels, and we want to check all the jewel combinations while avoiding redundancy (checking the same combination twice). As an example, we could recursively compare the sum of each jewel with the solution to the knapsack problem (our own code) of the other jewels. Whichever has the highest value without exceeding the weight limit is the solution. However, the higher the number of jewels the more times we recursively call the function. Since we redo the same calculation at least once for every jewel (green + red would be done at least twice), this is a very redundant solution. Other solutions seek to reduce the number of those recalculations. There are many methods of solving the knapsack problem, which goes beyond the scope of this blog post. (See GeeksforGeeks for a writeup of various solutions).


    As for Snakemake, the weight capacity of the knapsack is the number of cores available for Snakemake to use computationally. Each gem is a rule and its process, and its weight corresponds to its computational requirements. Lastly, the value is how important the job is to be completed so the pipeline can continue up the tree (i.e how many rules depend on its completion).

    The Knapsack problem’s method of solution does not change with the capacity of the knapsack. That is why Snakemake can scale from single-core personal computers to massive multi-cored servers. With the high computational demand of bioinformatics (such as DNA multiple sequence alignment and analyzing long videos), the small computations required to correctly allocate computational resources are completely worth it.


    Snakemake is a powerful tool for data analysis, but it is far from the only one out there. There are other options like cromwell, nextflow, and cwltoil (Here is an in depth comparison of them). In addition, there is a movement to standardize the language of workflow analysis software to make it more portable and scalable (cwltoil is part of that).

    Depending on your programming language of choice, and your needs, there are many workflow management options. Snakemake specializes in reproducibility and optimizing the average computational needs of the pipeline. We need workflow management systems in order to assist in making our processes not only more efficient but also easier to understand and work with.



Further reading:

Sustainable data analysis with Snakemake

A very in depth scientific paper on sustainability in research.


0-1 Knapsack problem: DP-10: GeeksforGeeks
A tutorial style article that goes over the possible solutions to the Knapsack Problem. Part of a larger series of computer science tutorials on how to solve these kinds of problems.


Evaluating Workflow Management Systems: A Bioinformatics Use Case

A scientific investigation into the various workflow management systems. Compares mainly management systems that are used in Bioinformatics.
 
 
 
Media Credits 
All images created by the author.
 
 
 

No comments:

Post a Comment