ClusterEmbed: Lightweight Protein Structure Prediction on PCs

Biological sequence design seeks to generate novel sequences, such as proteins, with optimized functional properties, a task complicated by vast combinatorial spaces and complex sequence-function relationships. Traditional offline methods limiting adaptability and long-term performance. This paper i...

Full description

Saved in:
Bibliographic Details
Main Author: Yuan Chuxin
Format: Article
Language:English
Published: EDP Sciences 2025-01-01
Series:BIO Web of Conferences
Online Access:https://www.bio-conferences.org/articles/bioconf/pdf/2025/33/bioconf_icfsb2025_02014.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849421095375994880
author Yuan Chuxin
author_facet Yuan Chuxin
author_sort Yuan Chuxin
collection DOAJ
description Biological sequence design seeks to generate novel sequences, such as proteins, with optimized functional properties, a task complicated by vast combinatorial spaces and complex sequence-function relationships. Traditional offline methods limiting adaptability and long-term performance. This paper introduces a novel online learning approach that integrates pre-trained language models (LMs), such as ESM-2, with gradient based search to dynamically refine a proxy model during optimization. By leveraging real-time updates, our method addresses the static constraints of prior work, achieving significant improvements: 29% faster convergence (600 vs. 850 steps), enhanced proxy accuracy (MSE 1.78 vs. 2.15), and higher sequence quality (fitness 78.9 vs. 72.3), while maintaining diversity (15.7 vs. 15.4). We systematically evaluate key variables—learning rate, update frequency, initial dataset size, and LM type—demonstrating their impact on performance across eight experiments, including long-term optimization up to 10,000 steps (fitness 82.5). The framework’s novelty lies in its hybrid design, combining online learning with a bi-level structure, a fusion underrepresented in the literature. This scalability and adaptability offer practical advantages for protein engineering and synthetic biology, where iterative refinement is essential.
format Article
id doaj-art-aa4a7cf1c26948438fbf82b46fd826f9
institution Kabale University
issn 2117-4458
language English
publishDate 2025-01-01
publisher EDP Sciences
record_format Article
series BIO Web of Conferences
spelling doaj-art-aa4a7cf1c26948438fbf82b46fd826f92025-08-20T03:31:33ZengEDP SciencesBIO Web of Conferences2117-44582025-01-011820201410.1051/bioconf/202518202014bioconf_icfsb2025_02014ClusterEmbed: Lightweight Protein Structure Prediction on PCsYuan Chuxin0Shoreline Community CollegeBiological sequence design seeks to generate novel sequences, such as proteins, with optimized functional properties, a task complicated by vast combinatorial spaces and complex sequence-function relationships. Traditional offline methods limiting adaptability and long-term performance. This paper introduces a novel online learning approach that integrates pre-trained language models (LMs), such as ESM-2, with gradient based search to dynamically refine a proxy model during optimization. By leveraging real-time updates, our method addresses the static constraints of prior work, achieving significant improvements: 29% faster convergence (600 vs. 850 steps), enhanced proxy accuracy (MSE 1.78 vs. 2.15), and higher sequence quality (fitness 78.9 vs. 72.3), while maintaining diversity (15.7 vs. 15.4). We systematically evaluate key variables—learning rate, update frequency, initial dataset size, and LM type—demonstrating their impact on performance across eight experiments, including long-term optimization up to 10,000 steps (fitness 82.5). The framework’s novelty lies in its hybrid design, combining online learning with a bi-level structure, a fusion underrepresented in the literature. This scalability and adaptability offer practical advantages for protein engineering and synthetic biology, where iterative refinement is essential.https://www.bio-conferences.org/articles/bioconf/pdf/2025/33/bioconf_icfsb2025_02014.pdf
spellingShingle Yuan Chuxin
ClusterEmbed: Lightweight Protein Structure Prediction on PCs
BIO Web of Conferences
title ClusterEmbed: Lightweight Protein Structure Prediction on PCs
title_full ClusterEmbed: Lightweight Protein Structure Prediction on PCs
title_fullStr ClusterEmbed: Lightweight Protein Structure Prediction on PCs
title_full_unstemmed ClusterEmbed: Lightweight Protein Structure Prediction on PCs
title_short ClusterEmbed: Lightweight Protein Structure Prediction on PCs
title_sort clusterembed lightweight protein structure prediction on pcs
url https://www.bio-conferences.org/articles/bioconf/pdf/2025/33/bioconf_icfsb2025_02014.pdf
work_keys_str_mv AT yuanchuxin clusterembedlightweightproteinstructurepredictiononpcs