Pages

Monday, December 5, 2022

Reproducibility! The Foundation of a Long-lasting Scientific Environment

Oxford Languages defines the verb reproduce as the ability to “create something very similar to (something else), especially in a different medium or context”. At first glance, the concept of reproducibility feels a bit odd, especially in an academic setting where we are often expected to demonstrate originality. But a deeper dive into the motivations of having reproducible research quickly illustrates the significance of documenting a research project in this way.


Communication has always played a vital role in the sharing of knowledge with others and subsequently allowed for the evolution of various technological and social advancements. It clearly follows then that the effective communication of scientific ideas and experiments is critical to ensure that science can be verified and interpreted by as many people as possible.


In the modern world, with the rapid growth of social media, communicating with each other has never been easier or faster. Historically, reproducibility meant the manual validation and repeated evaluation of some experimental result. During the 20th century, we saw the birth of the internet and with it came the increase of digital communication and data collection. Now, almost all scientific investigations and research utilize complex software and computational tools. It follows then that the art of writing a reproducible research paper in the 21st century requires the author to successfully summarize their broader findings while also presenting their data and code in an accessible, comprehensible manner. As a consequence of this requirement, we see that using and creating open source software and open access libraries and packages have become a norm in the industry. This evolution comes hand-in-hand with a new dependence on other scientists and technologies.


The Techno-Human Tower of Scientific Research[1]

Now, as a person who has been working with coding in an academic setting for the past few years, I have been constantly told by teachers and professors that proper commenting, utilizing version control, and following common code conventions is of utmost importance. It wasn’t until this project that I truly recognized the snowball effect that these seemingly small choices made, though.


For some background, this semester I have been working with the HMC Bee Lab on a computational Flower Mapping project. Our research team is working with machine learning models in order to determine the density of flowers, given an image of a plant. For more context, this project has been worked on by multiple students over multiple years. It has progressed from the UAV collection of raw flower data to segmented and annotated binary masks of the original images. Ultimately, this project is a part towards a larger goal to better understand the spatial distribution of flowering plants in a given landscape.


Evolution of a Flower Mapping Image in our Lab from 2016 to 2022[2]
Top Left: Drone-Collected Image of California Buckwheat Plants in Field; Top Right: Segmented Image with Boundary Mask around a Singular Plant; Bottom Left: Binary Mask of Plant Image with Flowers Annotated Using Computational Methods; Bottom Right: Visualization of Bounding Boxes Surrounding Flowers


My experience working in the Bee Lab this semester has been totally enhanced by the level of effort and detail past and present Bee Lab members have put into organizing and reporting their results and data. Our Google Drive has detailed notes of work from almost 7 years ago! Through the drive and repositories we have, you can see the growth and progress of the project and it is incredibly inspiring to see how so many different minds have come together to work on different aspects towards one bigger goal.


More than that, the organization of the work is an effective tool. While my partner and I were beginning our search to find a suitable machine learning model for our datasets, we utilized the lab documents and journals of other students to see what Github repositories they had used and what were some hurdles and limitations they had to overcome. This ensured that we weren’t just foraging out into the world of computer vision blind, but equipped with information about how the images and masks in our dataset were collected and processed, what models had been tested in the past, and some parameters that might streamline the model training pipeline. Additionally, the different theses and research journals that other students had created were basically a treasure trove of citations to external research articles that guided us in directions that we might not have considered ourselves. When we were first beginning our project, my partner and I wished to recreate some of the previous annotated images. We were able to do so using the scripts and instructions defined in the research theses of previous Bee Lab members. This allowed us to skip several steps in the coding process and feel confident that our code was performing how we expected.


By this point in the semester, my partner and I were excited and eager to begin working on implementing our machine learning model. We had found a specific research paper that we felt was the best starting point in our research. Utilizing the github repo provided in the paper, we began to code but quickly realized that everything was not as seamless as it had initially seemed. It was my first time working with the object detection algorithms library Detectron2 and so the unexplained hard-coded values and uncommented code blocks were difficult to comprehend. Working with this code helped me understand firsthand the difference that good commenting and management of a repository can have on the reproducibility of the project.


The Ultimate Crime to a Programmer[3]


Once we got more familiar with the network and model, we had to spend a lot of extra time deciphering the starting code and ultimately rewrote many scripts ourselves. Even now as we move towards testing our trained model, we are continuously running into bugs and coding discrepancies that we have to spend hours and sometimes even days debugging. Due to the uncommented nature of the code, we often felt unsure and apprehensive about some of the coding decisions made in the scripts. While the code ultimately performed the tasks the researchers had described in their paper, they could have been made more reproducible by providing example datasets and JSON files, properly documented version control, and more information about any specific choices they made.


These experiences have truly demonstrated the importance of reproducible coding and research to me. More than just making information accessible and understandable, reproducibility allows for a level of collaboration that surpasses almost any time or location boundaries, so we can build upon each other’s work. The power that communicating and transferring scientific information holds continues to astound me and I now put in considerable effort into ensuring that my code will be able to outlast my time here at the Bee Lab. In a year or two or even five, I hope that the next batches of Bee Lab researchers will be able to build off of the work that we are doing and trust that they can utilize our code in the most efficient and accessible way possible.





Further Reading:

Baker, Monya. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533, 452–454, 25 May 2016. https://doi.org/10.1038/533452a.


Cooper et al. “Reproducible Code: Guides to Better Science.” British Ecological Society, 2019.

https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible

Code-2019.pdf.


Moura, Gustavo. “The Art of Reproducible Research.” Medium, The Startup, 3 June 2020.

https://medium.com/swlh/the-art-of-reproducible-research-d4df0fb0331f.


Media credits:

[1]: This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 license. DOI: 10.5281/zenodo.3332807.

[2]: Top Left: Photo by Cassandra Burgess in HMC Bee Lab. 2016.

Top Right: Photo from Tom Fu in HMC Bee Lab. 2021.

Bottom Left: Photo by Alex Hadley, Thuy-Linh Le. Accessed via Github Repo:

https://github.com/alexhad6/cs153-flower-mapping, 2022.

Bottom Right: Photo created by the author for HMC Bee Lab. November 2022.

[3]: Photo created by the author using Imgflip: https://imgflip.com/memegenerator/22566289/I-killed-a-man-and-you.

No comments:

Post a Comment