NanoAbLLaMA: construction of nanobody libraries with protein large language models
IntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction throu...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-02-01
|
| Series: | Frontiers in Chemistry |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849722993667735552 |
|---|---|
| author | Xin Wang Haotian Chen Bo Chen Lixin Liang Fengcheng Mei Bingding Huang |
| author_facet | Xin Wang Haotian Chen Bo Chen Lixin Liang Fengcheng Mei Bingding Huang |
| author_sort | Xin Wang |
| collection | DOAJ |
| description | IntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.MethodsWe developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.ResultsNanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.DiscussionThis work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model’s code, data, and resources are publicly available to facilitate broader adoption. |
| format | Article |
| id | doaj-art-4aec6c66a07e4a0583bfc8ebfe213a8e |
| institution | DOAJ |
| issn | 2296-2646 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Chemistry |
| spelling | doaj-art-4aec6c66a07e4a0583bfc8ebfe213a8e2025-08-20T03:11:10ZengFrontiers Media S.A.Frontiers in Chemistry2296-26462025-02-011310.3389/fchem.2025.15451361545136NanoAbLLaMA: construction of nanobody libraries with protein large language modelsXin Wang0Haotian Chen1Bo Chen2Lixin Liang3Fengcheng Mei4Bingding Huang5College of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaChengdu NBbiolab. CO., LTD., SME Incubation Park, Chengdu, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaIntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.MethodsWe developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.ResultsNanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.DiscussionThis work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model’s code, data, and resources are publicly available to facilitate broader adoption.https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/fullreinforcement learninggenerative AInanobodieslibrariesprotein large language models |
| spellingShingle | Xin Wang Haotian Chen Bo Chen Lixin Liang Fengcheng Mei Bingding Huang NanoAbLLaMA: construction of nanobody libraries with protein large language models Frontiers in Chemistry reinforcement learning generative AI nanobodies libraries protein large language models |
| title | NanoAbLLaMA: construction of nanobody libraries with protein large language models |
| title_full | NanoAbLLaMA: construction of nanobody libraries with protein large language models |
| title_fullStr | NanoAbLLaMA: construction of nanobody libraries with protein large language models |
| title_full_unstemmed | NanoAbLLaMA: construction of nanobody libraries with protein large language models |
| title_short | NanoAbLLaMA: construction of nanobody libraries with protein large language models |
| title_sort | nanoabllama construction of nanobody libraries with protein large language models |
| topic | reinforcement learning generative AI nanobodies libraries protein large language models |
| url | https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/full |
| work_keys_str_mv | AT xinwang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels AT haotianchen nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels AT bochen nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels AT lixinliang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels AT fengchengmei nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels AT bingdinghuang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels |