EducationMed Tech

Genetic Code and Excel: An Uncanny Relationship

Life is composed, compiled, and communicated in code: Genetic code. Modern omics techniques are fundamental in enhancing our understanding of the past and the future.

With the human genome project and next-generation sequencing technology, genomic data is easily accessible. But when the amount of data to be evaluated expands beyond terabytes, higher are the chances of errors. Excel is an accessible spreadsheet program for analyzing organizing data and has a wide variety of features and over 500 functions. For many graduate students, excel is the best thing after sliced bread for navigating the data labyrinth. But unfortunately for a few scientists, Excel is known to cause unexpected chaos.

There are approximately 30000 genes in the human genome: minuscule uninterrupted 2 meters long and six picograms in weight, winding twists of DNA that make you uniquely you. Each gene is given a name and alphanumeric code for identification.

For amusing the reader, let us look at an example—the gene SEPT1 encodes a protein Septin 1 required for cytokinesis and the maintaining cellular morphology. If you type SEPT1 in an Excel cell, by default, it gets autocorrected to 1-September. This problem was first identified in 2004. Many studies have since found over 50% of the data studies to contain some form of such errors.

A study by Ziemann et al. pointed out the surprisingly high prevalence (about 20%) of corrupted gene symbols in supplementary data found in published genomics papers including erroneous gene name conversions. Furthermore, some novel genes are not fully annotated and have RIKEN identifiers, for example, ‘2310009E13’. These RIKEN identifiers were described to be automatically corrected to floating-point numbers (i.e., from accession ‘2310009E13’ to ‘2.31E+13’).

Now, Excel won’t change this functionality for a small section of the user community due to the needs of the larger user community. But this can have a snowball effect in genetic studies when you have around 30000 genes to work with. It is even more dangerous if there are such errors in studies that involve understanding outcomes of disease models or clinical trials.

Various open-source tools like Escape Excel and Truke help to prevent these erroneous conversions. Fortunately, the Human Gene Nomenclature Committee has also realized the impact of this unseemly behavior and has renamed 27 human genes. Henceforward, SEPT1 will be known as SEPTIN1.

  1. B.R. Zeeberg, J. Riss, D.W. Kane, K.J. Bussey, E. Uchio, W.M. Linehan, J.C. Barrett, J.N. Weinstein, Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics, BMC Bioinformatics. 5 (2004) 80. doi:10.1186/1471-2105-5-80.
  2. M. Ziemann, Y. Eren, A. El-Osta, Gene name errors are widespread in the scientific literature, Genome Biol. 17 (2016) 177. doi:10.1186/s13059-016-1044-7.


Dr. Jyoti Rawat is a biology researcher by profession. Her research focuses on making life saving drugs more affordable. She interested in seeing more women in science. She cooks, paints and tends to plants in her free time.

Show More

Related Articles

Leave a Reply

Back to top button