The “hurting” Bioinformatics message I have seen

- janeiro 19, 2024

Is this one:

It's 2024. Most of #bioinformatics is yet to come and the best we can offer future generations of scientists is idiosyncratic R-scripts, massive Nextflow pipelines and file formats such as FastQ? We have to do better. (Fabian Klötzl)

https://genomic.social/@kloetzl/111772016843202526

And in the end of the day… it kinda makes sense.

Why does it hurts?

Roughly speaking, R is the “default language” of bioinformatics. Even I, a pythonic headbanger, use R in bioinformatics research.
Nextflow is also getting this “defaultness” too. For those who don’t know, it is a technology that you can use to build (among other ones) data processing pipelines, and thus automate processes. Bioinformatics people have been using it a lot.
“file formats such as FastQ” have been (again) the “default format files” for years, maybe decades.

Personal interpretation (I may be wrong, please confirm it in the comments section): the message is that bioinformatics is a field running over a legacy/outdated technology and it needs a urgent “update” with modern and efficient stuff.

Why does it make sense?

Because there are some things today that can help to do better.

FastQ and other files

FastQ, Fasta, BED, GTF, GFF…. these are all standard bioinformatics file formats that whole community uses. The point is that they’re all text files, and Bioinformatics is a “(really) big data science”, such that storing data in text formats brings issues regarding size and processing efficiency. Data Engineering has been taken care of file formats in recent years, such that we have, for example, formats like hdf5 and ZARR, which provide high efficiency in storage and processing of data compared to text files. At the same time, we also have initiatives devoted for specific bioinformatics modalities, like SpatialData, that seeks to optimize data storage and processing for spatial omics datasets. So, “yes, we can”.

Programming and Pipeline development languages

This is a really delicate theme to touch in, because in the end of the day, programming and pipeline languages and technologies are not only quite subjective in terms of “what fits to you”, and each language will be more appropriate to solve a “specific set(s) of problems”.

The fact is that R was originally developed by statisticians and for statisticians. Somehow it also ended up fitting quite well for bioinformaticians such that many of the main and default bioinformatics packages are in R. Nevertheless, the last 20 years presented us programming languages that can be at least more efficient in terms of processing data (or at least have a “nicer syntax for some people”).

Besides that, at least in bioinformatics, reproducibility is matter sometimes more important than performance, as well as easy to use, such that there may be languages that overcome R in these aspects. But again, programming languages are like music, pick your favorite one (and the one that has the libraries you need).

Final thoughts

There’s a lot of room for improvement in bioinformatics technologies. The main challenges are:

Defining new standards to replace old ones
Convince the community to migrate
Perform the migration

Until there, we keep using “old technology” (that at least has been working quite well).

Pesquisar este blog

Pythonic Headbanger