HybriDNA

HybriDNA: Unlocking the Language of Life

Understanding DNA—the genetic code of life—is one of the most pressing challenges in modern science. HybriDNA is a breakthrough language model that revolutionizes DNA sequence analysis and design by combining Transformer and Mamba2 architectures, delivering unparalleled efficiency and accuracy. By processing ultra-long DNA sequences at single-nucleotide resolution, HybriDNA unlocks new possibilities in genomics, medicine, and biotechnology.

Key Highlights:

Hybrid AI Architecture: Combines the complementary strengths of Transformers and Mamba2.
Ultra-Long Context Processing: Efficiently handles DNA sequences up to 131,072 nucleotides with single-nucleotide resolution.
Multi-Species Genomes: Pretrained on diverse genomes across over 800 species, making it adaptable to various biological domains.
State-of-the-Art Performance: Excels in 33+ genomic understanding datasets from the BEND, GUE, LRB benchmarks.
Revolutionizing DNA Design: Generates realistic synthetic DNA elements (e.g., Human enhancer, Yeast promoter) with desired biological functions.
Scalability: Increasing the model size from 300M to 3B and 7B parameters improves downstream performance.

With these capabilities, HybriDNA is unlocking new frontiers in understanding and engineering the "language of life."

HybriDNA Model Pipeline

The HybriDNA model pipeline illustrates the stages of pretraining, echo embedding fine-tuning, and generative fine-tuning. The hybrid Transformer-Mamba2 architecture facilitates efficient long-range DNA sequence modeling, achieving high accuracy in both generative and comprehension tasks.

Computational Efficiency of HybriDNA

HybriDNA significantly improves efficiency over standard Transformer-based models, especially when handling long DNA sequences. By incorporating a hybrid Transformer-Mamba2 architecture, HybriDNA achieves subquadratic scaling while maintaining state-of-the-art performance. The Mamba2 blocks enable efficient processing of ultra-long sequences, reducing computational overhead compared to standard self-attention layers.

As shown in the figure, HybriDNA achieves 3.4× higher training throughput compared to a pure Transformer model at a sequence length of 49k tokens. Additionally, it maintains lower GPU memory usage, preventing out-of-memory (OOM) issues that occur in standard Transformer architectures at long context lengths. This efficiency allows scalability up to 131k nucleotides while maintaining computational feasibility.

Representative Tasks and Results

Causal eQTL Variant Effect Prediction

Genetic variants can influence gene expression, a fundamental process that drives cellular function and disease mechanisms. Causal eQTL prediction helps researchers identify which DNA mutations are responsible for changes in gene activity, aiding in the study of complex diseases like cancer and neurological disorders. HybriDNA has a 74% probability of correctly ranking a true causal genetic variant higher than a non-causal variant.

Transcription Factor Binding Prediction (Mouse)

Transcription factors (TFs) are proteins that bind to DNA to regulate gene expression. Predicting TF binding sites is crucial for understanding how genes are turned on and off in different cell types and conditions, impacting areas such as drug discovery and gene therapy. This experiment demonstrates the strong ability of HybriDNA to identify protein binding sites involved in regulating gene expression within mouse DNA sequences.

Chromatin Accessibility Prediction

Chromatin accessibility determines which regions of DNA are available for gene regulation. Understanding these open regions helps researchers identify active genes and epigenetic modifications, making it a key task for studies in cancer and developmental biology. The result indicates that HybriDNA can predict whether a region of DNA is accessible (open for gene regulation) or inaccessible with 84% accuracy.

Human Enhancer Generation

Enhancers are regulatory DNA sequences that boost gene expression. Generating synthetic enhancers can revolutionize gene therapy by enabling precise control over gene activity, helping to treat genetic diseases and improving cellular engineering techniques. The activity level is predicted by a separately trained sequence-to-activity regression model. This design task shows that HybriDNA can generate synthetic DNA sequences that can boost gene expression for desired cell types and also have high activity level than the previous best model.

These results demonstrate HybriDNA's ability to outperform previous state-of-the-art models in key genomic tasks, enabling advances in gene regulation, disease modeling, and synthetic biology.

Applications of HybriDNA

HybriDNA is a powerful tool for genomics, synthetic biology, and medicine. Below are some of its potential applications.

🧬 Genomic Research: Understanding regulatory elements and mutations.
🦠 Drug Discovery: Predicting interactions between genes and drugs.
🔬 Synthetic Biology: Designing new DNA sequences with targeted properties.
🧑‍⚕️ Medical Diagnostics: Assisting in disease gene discovery.

BibTeX

@misc{ma2025hybridnahybridtransformermamba2longrange,
      title={HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model}, 
      author={Mingqian Ma and Guoqing Liu and Chuan Cao and Pan Deng and Tri Dao and Albert Gu and Peiran Jin and Zhao Yang and Yingce Xia and Renqian Luo and Pipi Hu and Zun Wang and Yuan-Jyue Chen and Haiguang Liu and Tao Qin},
      year={2025},
      eprint={2502.10807},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.10807}, 
}