Reproducibility with Python - Notes and tips
I understand reproducibility as two situations:
- When you (yes, you) get to run (in your (yes, your) own machine) the code produced by “someone else” and obtain the exact same results that “someone else” obtained in her/his/their machine.
- When “someone else” gets to run in her/his/their own machine the code produced by you (yes, you) and obtains the exact same results that you obtained in your (yes, your) machine.
How can we achieve (or get closer to) reproducibility to our python code? Well, we must ensure that “someone else” will run our code using, at least, the same python version we were using as well as the same versions of the packages we were also using.
How can we “help someone else” with this? I recommend using in your projects one of these two solutions:
Both work with the concept of virtual environment, a “sandbox” in which you can specify, use and “freeze” the “exact state” of package versions (and in case of conda, even the python version) your code depends.
If you only work with python libraries, all of them “pure python” and your code is “pure python”, Poetry is the way to go. If you work with data science, or you are in the academic world, I suggest you to use Conda. With conda, you can also even install other programming languages and entire softwares (for example, many bioinformatics softwares such as samtools, BLAST and bowtie are available at Conda).
With poetry, you create a file called pyproject.toml, responsible for such specification of packages and their versions (and even the python version). With conda, you create a file called environment.yml, with the same purpose.
This way, when you post your code at your preferred git repository and make such files available, “someone else” will be able to reproduce the “environment state” you used to build and run the code.
IMPORTANT: when uploading code to a git repository and making them public, produce a good README.md and write documentation guiding “someone use” on how to prepare the environment to run the code. This will help a lot to ensure reproducibility.
Comentários
Postar um comentário