Summary: Researchers have developed an AI model that accurately predicts gene activity in any human cell, providing insights into cellular functions and disease mechanisms.
AI and Gene Activity: The AI model predicts gene expression in unseen cell types using genomic and expression data, enabling insights into cellular functions.
Pediatric Cancer Discovery: The system identified how specific mutations disrupt transcription factors in inherited pediatric leukemia, confirmed by lab experiments.
Exploring Genome “Dark Matter”: The model offers tools to study non-coding genome regions, illuminating the role of unexplored mutations in cancer and disease.
Using a new artificial intelligence method, researchers at Columbia University Vagelos College of Physicians and Surgeons can accurately predict the activity of genes within any human cell, essentially revealing
The system, described in the current issue of Nature, could transform the way scientists work to understand everything from cancer to genetic diseases.
These methods can effectively conduct large-scale computational experiments, boosting and guiding traditional experimental approaches,” says Raul Rabadan, professor of systems biology and senior author of the new paper.
Traditional research methods in biology are good at revealing how cells perform their jobs or react to disturbances. But they cannot make predictions about how cells work or how cells will react to change, like a cancer-causing mutation.
“It would turn biology from a science that describes seemingly random processes into one that can predict the underlying systems that govern cell behavior.”
In recent years, the accumulation of massive amounts of data from cells and more powerful AI models are starting to transform biology into a more predictive science.
“Previous models have been trained on data in particular cell types, usually cancer cell lines or something else that has little resemblance to normal cells,” Rabadan says.
Xi Fu, a graduate student in Rabadan’s lab, decided to take a different approach, training a machine learning model on gene expression data from millions of cells obtained from normal human tissues.
Fu and Rabadan soon enlisted a team of collaborators, including co-first authors Alejandro Buendia, now a Stanford PhD student formerly in the Rabadan lab, and Shentong Mo of Carnegie Mellon, to train and test the new model.
After training on data from more than 1.3 million human cells, the system became accurate enough to predict gene expression in cell types it had never seen, yielding results that agreed closely with experimental data.
Next, the investigators showed the power of their AI system when they asked it to uncover still hidden biology of diseased cells, in this case, an inherited form of pediatric leukemia.
With AI, the researchers predicted that the mutations disrupt the interaction between two different transcription factors that determine the fate of leukemic cells.
“The vast majority of mutations found in cancer patients are in so-called dark regions of the genome. These mutations do not affect the function of a protein and have remained mostly unexplored. says Rabadan.
Already, Rabadan is working with researchers at Columbia and other universities, exploring different cancers, from brain to blood cancers, learning the grammar of regulation in normal cells,
By presenting novel mutations to the computer model, researchers can now gain deep insights and predictions about exactly how those mutations affect a cell.
UAE, and Carnegie Mellon University, Pittsburgh, PA), Alejandro Buendia, Anouchka P. Laurent, Anqi Shao, Maria del Mar Alvarez-Torres, Tianji Yu, Jimin Tan (New York University Grossman School of Medicine,
New York, NY), Jiayu Su, Romella Sagatelian, Adolfo A. Ferrando (Columbia and Regeneron, Tarrytown, NY), Alberto Ciccia, Yanyan Lan (Tsinghua University, Beijing, China),
David M. Owens Teresa Palomero, Eric P. Xing (Mohamed bin Zayed University of Artificial Intelligence and Carnegie Mellon University), and Raul Rabadan.
Here we introduce GET (general expression transformer), an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types.
Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types.
GET also shows remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions,
We evaluated its performance in prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors
In fetal erythroblasts, we identified distal (greater than 1 Mbp) regulatory regions that were missed by previous models, and, in B cells, we identified a lymphocyte-specific
In sum, we provide a generalizable and accurate model for transcription together with catalogues of gene regulation and transcription factor interactions, all with cell type specificity.