Important Progress in AI4S Interdisciplinary Research at Our Institute | Mo Fanyang/Yuan Li and Collaborators Develop Machine Learning Model for Chiral Molecule Spectrum Prediction

Time:Jan 4, 2025

As a key spectroscopic technique,electronic circular dichroism (ECD)offers advantages such as low sample consumption and easy measurement. It has important and extensive applications in fields like asymmetric catalysis, functional materials, and drug discovery, becoming a powerful tool for exploring the absolute spatial configuration of chiral molecules. However, theoretical calculations of circular dichroism spectra are often complex and time-consuming, creating a time bottleneck in chemical research and drug development. With the development of artificial intelligence, machine learning-based automated molecular spectrum prediction has attracted widespread attention from researchers. Currently, in molecular spectrum prediction research, autoregressive models based on continuous sequence prediction have shown excellent performance and development potential in some tasks. Nevertheless, ECD spectra have sparse characteristic information, and direct continuous sequence modeling can lead to learning excessive irrelevant noise, resulting in model overfitting and poor generalization ability.

Recently, a collaborative team from Professor Mo Fanyang’s group and Professor Yuan Li’s group at Peking University, together with Professor Wáng Xīnchāng’s group at Xiamen University, published a research paper titled“Decoupled peak property learning for efficient and interpretable electronic circular dichroism spectrum prediction”inNature Computational Science.

Addressing the issues of time-consuming, labor-intensive DFT calculations for chiral small molecule ECD spectra with high professional thresholds, this study converts continuous spectrum prediction tasks into discrete spectral peak characterization learning tasks. This enables fast, accurate, and universal prediction of several molecular spectra (represented by ECD spectra) and mass spectra, with reliability validated using various chiral natural product molecules.

The research team proposed an innovative deep learning model calledECDFormer, which decouples continuous spectral sequences into combinations of discrete tokens based on spectral peak attribute information. It uses query variables to perform self-attention-based spectral attribute learning, thereby constructing joint representations between spectral peak structures and molecular functional groups. In the prediction phase, the work first learns molecular topological structure representations. Then, based on the spectral-molecular structure joint representation space, it independently predicts the number, position, and intensity of spectral peaks. Finally, Gaussian functions are used to broaden discrete peak attributes into continuous spectral sequences. This peak-decoupled spectrum prediction scheme significantly improves prediction speed and accuracy, with good scalability for multiple spectral tasks.

Model structure and peak-decoupled ECD spectrum prediction workflow of this work

Associate Professor Mo Fanyang, the corresponding author of the paper, proposed the research idea and initiated the study. He co-supervised the entire research project with collaborators (co-corresponding authors) Assistant Professor Yuan Li and Associate Professor Wang Xinchang from Xiamen University. Professor Tian Yonghong, Assistant to the Dean of the Shenzhen Graduate School and Dean of the School of Information Engineering, provided algorithm guidance and computing resource support for the project. Doctoral student Li Hao (a recipient of the AI4S "Dual Mentor" Pilot Program at Peking University Shenzhen Graduate School) and doctoral student Long Da from Xiamen University are co-first authors of the paper.

This work is supported by funds and projects including the National Natural Science Foundation of China, the Xiamen University President’s Fund, and the AI4S Interdisciplinary Research Special Project of Peking University Shenzhen Graduate School.

[Extended Reading]

Nature Computational Sciencewas launched in January 2021 and was officially indexed in SCIE on December 16, 2024. The journal covers key themes in computational science, including but not limited to chemoinformatics, geoinformatics, computational models, materials science, and urban science. Its main goal is to promote interdisciplinary research and cross-disciplinary applications of new computational technologies, focusing on the development and use of computational techniques and mathematical models to solve complex problems in a range of scientific disciplines.

Contact us

No. 2199 Lishui Road, Xilihu, Nanshan District, Shenzhen, China

Postal Code: 518055

Copyright © Peking University School of Al for Science All rights reserved