NanoAbLLaMA: construction of nanobody libraries with protein large language models

IntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction throu...

Full description

Saved in:
Bibliographic Details
Main Authors: Xin Wang, Haotian Chen, Bo Chen, Lixin Liang, Fengcheng Mei, Bingding Huang
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Chemistry
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849722993667735552
author Xin Wang
Haotian Chen
Bo Chen
Lixin Liang
Fengcheng Mei
Bingding Huang
author_facet Xin Wang
Haotian Chen
Bo Chen
Lixin Liang
Fengcheng Mei
Bingding Huang
author_sort Xin Wang
collection DOAJ
description IntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.MethodsWe developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.ResultsNanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.DiscussionThis work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model’s code, data, and resources are publicly available to facilitate broader adoption.
format Article
id doaj-art-4aec6c66a07e4a0583bfc8ebfe213a8e
institution DOAJ
issn 2296-2646
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Chemistry
spelling doaj-art-4aec6c66a07e4a0583bfc8ebfe213a8e2025-08-20T03:11:10ZengFrontiers Media S.A.Frontiers in Chemistry2296-26462025-02-011310.3389/fchem.2025.15451361545136NanoAbLLaMA: construction of nanobody libraries with protein large language modelsXin Wang0Haotian Chen1Bo Chen2Lixin Liang3Fengcheng Mei4Bingding Huang5College of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaChengdu NBbiolab. CO., LTD., SME Incubation Park, Chengdu, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaCollege of Big Data and Internet, Shenzhen Technology University, Shenzhen, ChinaIntroductionTraditional methods for constructing synthetic nanobody libraries are labor-intensive and time-consuming. This study introduces a novel approach leveraging protein large language models (LLMs) to generate germline-specific nanobody sequences, enabling efficient library construction through statistical analysis.MethodsWe developed NanoAbLLaMA, a protein LLM based on LLaMA2, fine-tuned using low-rank adaptation (LoRA) on 120,000 curated nanobody sequences. The model generates sequences conditioned on germlines (IGHV3-301 and IGHV3S5301). Training involved dataset preparation from SAbDab and experimental data, alignment with IMGT germline references, and structural validation using ImmuneBuilder and Foldseek.ResultsNanoAbLLaMA achieved near-perfect germline generation accuracy (100% for IGHV3-301, 95.5% for IGHV3S5301). Structural evaluations demonstrated superior predicted Local Distance Difference Test (pLDDT) scores (90.32 ± 10.13) compared to IgLM (87.36 ± 11.17), with comparable TM-scores. Generated sequences exhibited fewer high-risk post-translational modification sites than IgLM. Statistical analysis of CDR regions confirmed diversity, particularly in CDR3, enabling the creation of synthetic libraries with high humanization (>99.9%) and low risk.DiscussionThis work establishes a paradigm shift in nanobody library construction by integrating LLMs, significantly reducing time and resource demands. While NanoAbLLaMA excels in germline-specific generation, limitations include restricted germline coverage and framework flexibility. Future efforts should expand germline diversity and incorporate druggability metrics for clinical relevance. The model’s code, data, and resources are publicly available to facilitate broader adoption.https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/fullreinforcement learninggenerative AInanobodieslibrariesprotein large language models
spellingShingle Xin Wang
Haotian Chen
Bo Chen
Lixin Liang
Fengcheng Mei
Bingding Huang
NanoAbLLaMA: construction of nanobody libraries with protein large language models
Frontiers in Chemistry
reinforcement learning
generative AI
nanobodies
libraries
protein large language models
title NanoAbLLaMA: construction of nanobody libraries with protein large language models
title_full NanoAbLLaMA: construction of nanobody libraries with protein large language models
title_fullStr NanoAbLLaMA: construction of nanobody libraries with protein large language models
title_full_unstemmed NanoAbLLaMA: construction of nanobody libraries with protein large language models
title_short NanoAbLLaMA: construction of nanobody libraries with protein large language models
title_sort nanoabllama construction of nanobody libraries with protein large language models
topic reinforcement learning
generative AI
nanobodies
libraries
protein large language models
url https://www.frontiersin.org/articles/10.3389/fchem.2025.1545136/full
work_keys_str_mv AT xinwang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels
AT haotianchen nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels
AT bochen nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels
AT lixinliang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels
AT fengchengmei nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels
AT bingdinghuang nanoabllamaconstructionofnanobodylibrarieswithproteinlargelanguagemodels