You are viewing a single comment's thread from:

RE: Big Data and Bioinformatics: where we are on Omics technology

in StemSocial2 years ago

Well this post has already 2 months old! Nice to see people talking about bioinformatics here. Storage is one of the main concerns in the field. The output size of a experiment data depends of some factors, such as:
. organism -> like cyprianj said less complex organisms have smaller genomes, while humans have 3.2 billion of nucleotides (letters)
. experiment -> sequencing experiments can vary output, for example whole genome sequencing is one with the largest outputs, also the technology can influence a bit, different platforms and chipsets generates more or less throughput.
I will give an example, one sequencing run from small RNA-sequencing of human liquid plasma can give me a raw text file of a couple of gb only , less than 10gb some times. However one whole genome sequencing of human using Illumina NovaSeq technology with paired end technology, give me 2 files of 70gb each.
how to store this data? usually the sequencer generates this plain text file called fastqs, you can compress it using gunzip for example, ncbi has a repository called SRA (https://www.ncbi.nlm.nih.gov/sra) which usually people use to submit data before publishing a scientific article, it is a public repository of raw sequencing data. They have a compression system there to reduce the size of the data.
In addition to the raw sequencing fastq files, since they are random letter sequencing, another problem is to store the product of the mapping of this file. We align the sequences to a genome, for example if you have a human sequencing fastq, you map the reads to the human genome, the result is another huge file plain text called SAM, it is much bigger than the raw sequencing file usually, however you can compress this plain text into a binary file called BAM which is compressed and occupies less space. A SAM from a human genome sequencing can have hundreds of GBs , but a BAM can be less than 100 GB depending of your data.
In addition to SRA, like the author cited, there is the TCGA project, the human cancer genome atlas, you can download thousands of raw data from cancer patients from this project, however the access is controlled, you need to create a login into the system, write an abstract of the project and have some other burocracy steps in the middle.

Sort:  
 2 years ago  

Thanks for bringing this discussion back to life! I really enjoyed the quantitative information in your reply, providing more details to what @cyprianj wrote a while ago. (I was almost tented to register to download cancer data. I however have too much work already on my plate ;)

By the way, feel free to browse the STEMsocial community for more scientific discussions and threads. We are also available from our discord server.

Cheers!