Secondary Structure-Guided Novel Protein Sequence Generation with Latent Graph Diffusion
Abstract
Designing protein sequences with restrictions or conditions is an important research topic in biology. Many powerful deep generative models have been proposed to create proteins belonging to specific families or with determined backbone structures. However, the amount of homologous data is not always sufficient for any proteins to train a model, and proteins from the same family may lack the necessary structural similarity, posing challenges in ensuring the presence of crucial structures in the generated proteins. On the other hand, when generating proteins with fixed backbone, there exists a trade-off between reliability and flexibility of sequence generation, necessitating prior specification of protein length and precise positions of amino acids. This work introduces a flexible protein generation method for amino acid sequence generation with latent diffusion models and protein language models. The generation is conditioned on protein secondary structures to address the practical considerations in bioengineering better. It enables the imposition of structural constraints on generated proteins while ensuring an adequate level of novelty and diversity at the sequence level. We compare the performance of our method against popular language models and structure-based methods using quantifiable metrics, demonstrating its superiority in generating diverse and novel sequences that exhibit high foldability. Furthermore, we provide case studies of generating proteins with specific secondary structures to analyze the biological significance of our method. The source code is publicly available at https://github.com/riacd/CPDiffusion-SS.