Understanding DNA—the genetic code of life—is one of the most pressing challenges in modern science. HybriDNA is a breakthrough language model that revolutionizes DNA sequence analysis and design by combining Transformer and Mamba2 architectures, delivering unparalleled efficiency and accuracy. By processing ultra-long DNA sequences at single-nucleotide resolution, HybriDNA unlocks new possibilities in genomics, medicine, and biotechnology.
With these capabilities, HybriDNA is unlocking new frontiers in understanding and engineering the "language of life."
The HybriDNA model pipeline illustrates the stages of pretraining, echo embedding fine-tuning, and generative fine-tuning. The hybrid Transformer-Mamba2 architecture facilitates efficient long-range DNA sequence modeling, achieving high accuracy in both generative and comprehension tasks.
HybriDNA significantly improves efficiency over standard Transformer-based models, especially when handling long DNA sequences. By incorporating a hybrid Transformer-Mamba2 architecture, HybriDNA achieves subquadratic scaling while maintaining state-of-the-art performance. The Mamba2 blocks enable efficient processing of ultra-long sequences, reducing computational overhead compared to standard self-attention layers.
As shown in the figure, HybriDNA achieves 3.4× higher training throughput compared to a pure Transformer model at a sequence length of 49k tokens. Additionally, it maintains lower GPU memory usage, preventing out-of-memory (OOM) issues that occur in standard Transformer architectures at long context lengths. This efficiency allows scalability up to 131k nucleotides while maintaining computational feasibility.
Genetic variants can influence gene expression, a fundamental process that drives cellular function and disease mechanisms. Causal eQTL prediction helps researchers identify which DNA mutations are responsible for changes in gene activity, aiding in the study of complex diseases like cancer and neurological disorders. HybriDNA has a 74% probability of correctly ranking a true causal genetic variant higher than a non-causal variant.
Transcription factors (TFs) are proteins that bind to DNA to regulate gene expression. Predicting TF binding sites is crucial for understanding how genes are turned on and off in different cell types and conditions, impacting areas such as drug discovery and gene therapy. This experiment demonstrates the strong ability of HybriDNA to identify protein binding sites involved in regulating gene expression within mouse DNA sequences.
Chromatin accessibility determines which regions of DNA are available for gene regulation. Understanding these open regions helps researchers identify active genes and epigenetic modifications, making it a key task for studies in cancer and developmental biology. The result indicates that HybriDNA can predict whether a region of DNA is accessible (open for gene regulation) or inaccessible with 84% accuracy.
Enhancers are regulatory DNA sequences that boost gene expression. Generating synthetic enhancers can revolutionize gene therapy by enabling precise control over gene activity, helping to treat genetic diseases and improving cellular engineering techniques. The activity level is predicted by a separately trained sequence-to-activity regression model. This design task shows that HybriDNA can generate synthetic DNA sequences that can boost gene expression for desired cell types and also have high activity level than the previous best model.
These results demonstrate HybriDNA's ability to outperform previous state-of-the-art models in key genomic tasks, enabling advances in gene regulation, disease modeling, and synthetic biology.
HybriDNA is a powerful tool for genomics, synthetic biology, and medicine. Below are some of its potential applications.
@misc{ma2025hybridnahybridtransformermamba2longrange,
title={HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model},
author={Mingqian Ma and Guoqing Liu and Chuan Cao and Pan Deng and Tri Dao and Albert Gu and Peiran Jin and Zhao Yang and Yingce Xia and Renqian Luo and Pipi Hu and Zun Wang and Yuan-Jyue Chen and Haiguang Liu and Tao Qin},
year={2025},
eprint={2502.10807},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.10807},
}