MSc Thesis Presentation - Xiang Zhang

Date

November 28, 2025 2:00 PM –3:30 PM

Name: Xiang Zhang

Date: Nov 28, 2025

Time: 2:30 pm

Location: ICCS 104

Supervisors: Laks V.S Lakshmanan and Muhammad Abdul-Mageed

Title : Language Modeling Techniques for Biological Sequence Processing

Abstract:
Biological sequences---DNA, RNA, and proteins---form the basis of genetic information in all living organisms and viruses, dictating everything from genetic inheritance to biochemical processes and physical characteristics. Accurate prediction and generation of these sequences are crucial for advancing personalized medicine, developing novel therapeutics, and understanding evolutionary processes. However, traditional language modeling techniques often struggle to achieve the level of precision required for real-world applications, where even a single bio-token error can render predictions biologically meaningless.

This thesis addresses the central challenge of enhancing the precision of biological sequence processing through language modeling innovations that tackle different sources of prediction failure distinct from those arising in natural language generation tasks.
We identify that biological sequence prediction suffers from both local semantic errors and global constraint violations. To address local errors, we introduce a reflection-based biological sequence pretraining framework that augments the autoregressive Transformer with self-correction capabilities. By incorporating auxiliary reasoning tokens and training the model to recognize and correct its own mistakes, we achieve significant improvements in amino acid precision and peptide-level accuracy.
To address global constraint violations in biological sequence generation, we develop a non-autoregressive Transformer that leverages bidirectional global contextual constrained optimization. By incorporating a sequence-level precise mass control module, this approach achieves state-of-the-art results in protein sequencing tasks.
These approaches demonstrate that biological sequence modeling demands domain-specific adaptations of language modeling techniques. Local error correction via reflection mechanisms addresses semantic and reasoning failures, while global constrained optimization enforces physical and chemical validity. Each method offers a distinct perspective on overcoming the limitations of naive natural language modeling in biological contexts.