You are viewing a single comment's thread from:

RE: Big Data and Bioinformatics: where we are on Omics technology

in StemSocial2 years ago

Nice to read from you again, although I was a bit late with this post (for the reasons already mentioned too much on other blogs ;) ).

This is aimed at making hive and @stemsocial in particular a true repository of facts and information that might not be easily gotten on the internet. Some of the information might be seen online but likely not as easy to understand as with the one you get on @Stemsocial. This is my simple long term goal and I believe it is achievable.

I also guess that your long term goal is achievable, and we at STEMsocial will do our best to support it as much as we can!

With the present blog, you immediately triggered my interest and the first question I wrote on my pad was about how to get big dataset in practice in the case considered here. Moreover, this question is actually a double one as it additionally refers to the size of the dataset. How large are they in terms of storage space? Would it be possible to be a bit more quantitative?

I am also wondering how is data stored and under which format. You mentioned a few examples, but I am wondering whether those are text-based formats or whether they rely on something more involved (I guess the answer is the latter). Do you mind providing some more details? Thanks in advance!

Finally, do large datasets really exist as freely available for being analysed by any group (as it seems to be pointed out in your blog)? Are there additionally universal classes of open-source codes allowing us to read this data easily, and thus post-process it for an analysis?

Thanks in advance for your answers, and sorry for the questions that may sound too naive.

Sort:  
 2 years ago  

Alright Sir, thanks for the lovely questions and as usual I will pick them one after the other.

How large are they in terms of storage space? Would it be possible to be a bit more quantitative?

Omics Data have different sizes depending on the organism in question. For example Lower organism have their size to vary 12kb to 12mb, while higher living things weigh 2mb to over 100,000mb. For Humans, it's within 6200mega base pairs. So far, more information have surfaced and some have made claims that it's more than than.

For omics data analysis, this is what I am currently studying with.
Looks quite complicated but in no I am confident I will perfect it.

https://server.t-bio.info/

The gene sequence data are actually free accessible on NCBI (National Centre for Biotechnology information) site. It is a open access domain where Al the genes that have so far been sequenced are found. All you just do is to copy the gene sequence and then send it to the t-bio platform for analysis.

https://www.ncbi.nlm.nih.gov/genome/?term=51

This Link will take you directly to the NCBI where you get to have a glimpse of information about the whole sequenced human genome.

 2 years ago  

Thanks for providing all these pieces of information. I don't get everything from the shared websites, but I think that I managed to get some basic stuff from them.

Have a nice end of the week!

Well this post has already 2 months old! Nice to see people talking about bioinformatics here. Storage is one of the main concerns in the field. The output size of a experiment data depends of some factors, such as:
. organism -> like cyprianj said less complex organisms have smaller genomes, while humans have 3.2 billion of nucleotides (letters)
. experiment -> sequencing experiments can vary output, for example whole genome sequencing is one with the largest outputs, also the technology can influence a bit, different platforms and chipsets generates more or less throughput.
I will give an example, one sequencing run from small RNA-sequencing of human liquid plasma can give me a raw text file of a couple of gb only , less than 10gb some times. However one whole genome sequencing of human using Illumina NovaSeq technology with paired end technology, give me 2 files of 70gb each.
how to store this data? usually the sequencer generates this plain text file called fastqs, you can compress it using gunzip for example, ncbi has a repository called SRA (https://www.ncbi.nlm.nih.gov/sra) which usually people use to submit data before publishing a scientific article, it is a public repository of raw sequencing data. They have a compression system there to reduce the size of the data.
In addition to the raw sequencing fastq files, since they are random letter sequencing, another problem is to store the product of the mapping of this file. We align the sequences to a genome, for example if you have a human sequencing fastq, you map the reads to the human genome, the result is another huge file plain text called SAM, it is much bigger than the raw sequencing file usually, however you can compress this plain text into a binary file called BAM which is compressed and occupies less space. A SAM from a human genome sequencing can have hundreds of GBs , but a BAM can be less than 100 GB depending of your data.
In addition to SRA, like the author cited, there is the TCGA project, the human cancer genome atlas, you can download thousands of raw data from cancer patients from this project, however the access is controlled, you need to create a login into the system, write an abstract of the project and have some other burocracy steps in the middle.

 2 years ago  

Thanks for bringing this discussion back to life! I really enjoyed the quantitative information in your reply, providing more details to what @cyprianj wrote a while ago. (I was almost tented to register to download cancer data. I however have too much work already on my plate ;)

By the way, feel free to browse the STEMsocial community for more scientific discussions and threads. We are also available from our discord server.

Cheers!