Applications of Transformers in Computational Chemistry: Recent Progress and Prospects DOI

Rui Wang,

Yujin Ji, Youyong Li

et al.

The Journal of Physical Chemistry Letters, Journal Year: 2024, Volume and Issue: unknown, P. 421 - 434

Published: Dec. 31, 2024

The powerful data processing and pattern recognition capabilities of machine learning (ML) technology have provided technical support for the innovation in computational chemistry. Compared with traditional ML deep (DL) techniques, transformers possess fine-grained feature-capturing abilities, which are able to efficiently accurately model dependencies long-sequence data, simulate complex diverse chemical spaces, explore logic behind data. In this Perspective, we provide an overview application transformer models We first introduce working principle analyze transformer-based architectures Next, practical applications a number specific scenarios such as property prediction structure generation. Finally, based on these research results, outlook field future.

Language: Английский

Real-World Applications and Experiences of AI/ML Deployment for Drug Discovery DOI Creative Commons
William R. Pitt,

Jonathan Bentley,

Christophe Boldron

et al.

Journal of Medicinal Chemistry, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 8, 2025

InfoMetricsFiguresRef. Journal of Medicinal ChemistryASAPArticle This publication is free to access through this site. Learn More CiteCitationCitation and abstractCitation referencesMore citation options ShareShare onFacebookX (Twitter)WeChatLinkedInRedditEmailJump toExpandCollapse EditorialJanuary 8, 2025Real-World Applications Experiences AI/ML Deployment for Drug DiscoveryClick copy article linkArticle link copied!Will R. Pitt*Will PittMolecular Architects, Evotec Ltd, Dorothy Crowfoot Hodgkin Campus, 114 Innovation Drive, Milton Park, Abingdon, Oxfordshire OX14 4RZ, U.K.*Email: [email protected]More by Will Pitthttps://orcid.org/0000-0001-8164-4550Jonathan BentleyJonathan BentleyDiscovery Chemistry, U.K.More Jonathan BentleyChristophe BoldronChristophe BoldronMolecular SAS, Campus Curie, 195, route d'Espagne, Toulouse 31095, FranceMore Christophe BoldronLionel ColliandreLionel Colliandrein silico R&D, 195 31100 Toulouse, Lionel ColliandreCarmen EspositoCarmen Espositoin Aptuit Srl, Via Alessandro Fleming, 4, 37135 Verona, ItalyMore Carmen EspositoElizabeth H. FrushElizabeth FrushMolecular Inc., 303B College Road East, Princeton, New Jersey 08540, United StatesMore Elizabeth Frushhttps://orcid.org/0000-0003-3611-132XJola KopecJola Kopecin Jola KopecStéphanie LabouilleStéphanie LabouilleMolecular Stéphanie LabouilleJerome MeneyrolJerome MeneyrolMolecular Jerome MeneyrolDavid A. PardoeDavid PardoeMolecular David Pardoehttps://orcid.org/0009-0005-0807-2994Ferruccio PalazzesiFerruccio Palazzesiin Ferruccio PalazzesiAlfonso PozzanAlfonso PozzanMolecular Alfonso PozzanJacob M. RemingtonJacob RemingtonMolecular Jacob RemingtonRené RexRené RexEvotec International GmbH, Marie-Curie-Str. 7, Göttingen D-37079, GermanyMore René RexMichelle SoutheyMichelle SoutheyMolecular Michelle SoutheySachin VishwakarmaSachin Vishwakarmain Sachin VishwakarmaPaul WalkerPaul WalkerCyprotex Discovery No. 24 Mereside, Alderley Macclesfield, Cheshire SK10 4TG, Paul WalkerOpen PDFJournal ChemistryCite this: J. Med. Chem. 2025, XXXX, XXX, XXX-XXXClick citationCitation copied!https://pubs.acs.org/doi/10.1021/acs.jmedchem.4c03044https://doi.org/10.1021/acs.jmedchem.4c03044Published January 2025 Publication History Received 11 December 2024Published online 8 2025editorialPublished American Chemical Society. available under these Terms Use. Request reuse permissionsThis licensed personal use The ACS PublicationsPublished SocietySubjectswhat are subjectsArticle subjects automatically applied from the Subject Taxonomy describe scientific concepts themes article.Bioinformatics computational biologyDrug discoveryMedicinal chemistryOptimizationStructure activity relationshipThe emergence artificial intelligence (AI) machine learning (ML) in field drug discovery has been propelled significant advances computer science, infrastructure, surge "big data". There also an expectation that AI-related progress other fields, such as virtual assistants, image generation, autonomous vehicles, protein structure prediction, can be replicated elsewhere. continuous desire bring novel treatments market driven companies, including large pharmaceutical firms, biotechs, contract research organizations (CROs), deploy technology both strengthen accelerate pipelines. These companies face decision whether build or buy, either invest internal staff infrastructure establish in-house capabilities collaborate with AI-enabled companies. (1) It noteworthy ML medicinal chemistry began more than 40 years ago. (2) However, recent field, particularly rise deep learning, methods now impacting every stage process, early target identification, hit finding lead optimization. Examples include screening (VS) ultralarge chemical databases models predict potency relevant end points, well generative design algorithms molecular structures scratch. In paper we will present our perspective a CRO involved (and development) partnerships. Given competitive landscape, ours need stay abreast technological advancements because potential partners seek advantage integrating tools into their projects guide generation exploitation high-quality experimental data. For us, commitment do crucial ensure comprehensive robust process.However, accurate prediction data remains challenging due intrinsic complexity biological systems, availability quality training data, limited ability descriptors fully capture nature interactions. cultural challenges adoption AI. (3) inherent biases decision-making within documented. (4,5) Such hinder prevent integration technologies they implicitly challenge well-established working practices. situation further complicated often-exaggerated claims regarding effectiveness impact accelerating process. premature draw definitive conclusions not yet witnessed introduction treatment developed solely methods. (6)In experience, blending approaches, technologies, human experience produces best outcomes. enhance was consistent company's ethos innovation. Building own provides cost-efficient opportunity evaluate and/or develop most appropriate foster talent development. Our organization covers whole process clinical trial support, broad range therapeutic modalities. work many ways, aiding antibody (7) small molecule targeted degrader design. focus on identification late optimization.AI/ML Methodologies ApplicationsClick section linkSection copied!Briefly summarized below others' experiences applications currently have greatest work. Machine Representations SpaceUsing represent space major development informatics. Compounds represented vectors, generated neural networks compound databases. representations termed latent derived mathematically set encapsulate its essential features. A given vector (position space) decoded structure, which great benefit over older like fingerprints. enables rapid compounds interest new regions. instance, interpolation between vectors allows exploration intermediate structures, way move patentable space.One pioneering examples Continuous Data-Driven Descriptors (CDDD), (8) used extensively generating designs (see ways Generative Design below). CDDD, autoencoder (AE), simultaneously trained SMILES (9) constrained properties (e.g., polar surface area lipophilicity) push chemically physically similar molecules subspaces. predisposes transfer (TL), i.e. changing task pretrained model adding new, project-specific thereby focusing objectives properties. (10,11) linkage similarity calculated provided AE architecture another fingerprints.We AE-based Seq2Seq (12) models, utilizing recurrent (RNNs) (13) transformer architectures. (14−17) By sets curated in-house, achieved improved performance flexibility downstream tasks. improvements coverage weight greater 600 Da, necessary some projects. They extraction features quantitative structure–activity relationship (QSAR) building. Combining QSAR (DGC) same space, employ optimization Bayesian (BO) (18,19) particle swarm (PSO) (20) perform inverse (21)/inverse means generate optimized against predictions.The critical, it directly impacts reliability accuracy subsequent applications. We validate representation based DGC validity, novelty, drug-likeness, along metrics quantifying smoothness objective functions. (22) Together, validations allow scientists make informed decisions confidence. Learning (ML)In section, briefly how absorption, distribution, metabolism, excretion, toxicity (ADMET) points (23) physicochemical ─ approaches commonly referred structure–property (QSPR) modeling, respectively.The predictive depends standardized assays, carefully remove unreliable inconsistent measurements. assays logD, aqueous solubility, Caco2 permeability, microsomal clearance, hERG channel inhibition. Specific curation processes implemented regression (continuous predictions) classification tasks (discrete predictions), ensuring only used. To streamline activities facilitate regular updating automated workflow encompasses preparation, calculation, selection, hyperparameter optimization, delivery. ML-generated predictions finally interpreted using explainability techniques, estimate contribution input decision. (24)In years, application techniques QSAR/QSPR modeling shown promise. Graph Neural Networks (GNN) particular, outperform traditional Random Forest (RF) certain points. (25,26) typically spanning few hundred ten thousand usually models. Nonetheless, GNNs proven useful robustness when larger sets.Predictive QSPR play pivotal role projects, idea selection prioritization. One context scoring functions tools. DesignThe recently emerged powerful approach chemistry. previous review (27) identified 100 de novo published 2017 2020. Since then, explosion topic made hard keep track all articles. find papers often lack real-world perspective, since researchers fortunate enough able synthesize test designs. routinely successfully state-of-the-art 2D 3D then tested. Due time constraints vetting tools, reputable sources.One tool adopted modified upon feedback REINVENT. (28,29) reinforcement method generates scores positive loop. findings suggest highly connected components drive toward project specific goals. pharmacophore-based matches docking scores, produce desired rapidly alone. (30) iterations, advanced ADMET standard improve compounds. agreement authors, (31) cannot simplified simple button-clicking exercise.Postprocessing results obtained any crucial, three main reasons. First, posteriori, cost. relative binding energies (RBFE) (32) fragment orbital (FMO) interaction energies. (33) Second, always optimize multiple simultaneously, reason, them must sequentially during postprocessing stage. grow ligands pocket enthalpic contributions potency, protein–ligand Finally, evolve time, so importance step, developing pipelines integrate conventional chemistry, AI/ML, physics-based calculations speed up Computational Pipeline Protein ModelingAn incredibly project. Usually, X-ray crystallography cryogenic electron microscopy (cryo-EM). Until very recently, non-AI were homology proteins available. AlphaFold 2 (AF2), member family predicting AI, demonstrated remarkable predictions. (34) local installation resource iterative construct preparation fitted experimentally density. combined AF2 ProteinMPNN (35) increase stability production yield. transform where possible isolate miniscule amounts protein. AF Multimer (36) protein–protein complexes helps structural biologists obtain initial targets. density refined. Novel modeled FoldDock, (37) optimizes sequence alignments multimer run, producing better score separating acceptable incorrect models.The AlphaFoldDB (34,38) database DeepMind hosted EBI, Multimer, tremendous resources aspects ligandability estimation VS docking. aim targets complex interest. When possible, classical presence known ligand side chains site suitable conformation docking.Recent enabled ligand-protein complexes. Methods RoseTTAFold-AllAtom, (39) Umol, (40) AF3 (41) claim details proteins' interactions ligands, metal ions, nucleic acids, covalent binders precision surpassing established watch developments Active LearningMedicinal operates especially true hit-to-lead phase Where thin ground expensive generate, active (AL) purpose sufficient efficient manner. precise, AL ML-based strategy aims maximize respect (objective function) minimal algorithm iteratively selects predefined pool unlabeled items (in case ideas) according so-called acquisition function, balances (selection promising, current knowledge) less unknown regions model's overall knowledge). (42) Analogously, BO seeks identify next defined parameter optimum objective, could multiparameter (MPO) score. MPO contain primary assay lipophilicity, metabolic stability, permeability measurements follow fewer off-target enzymes, receptors, transporters, depending requirements. informative vast space. (43,44) enable make-on-demand libraries Enamine REAL (45) reduce number needed reach goals.Traditional structure-based ligand-based too computationally time-consuming brute-force billions (46) Additionally, costs function. solution, built open source MolPal, combines dynamics (MD)-based highest performing compounds.The Design-Make-Test-Analyze (DMTA) cycle configured explores (47) way, assisting selecting experimentally. should ultimately reduction cycles. form, ranks list coming chemists' ideas. While limits exploratory capacity, acceptance proposed solutions designers search manageable size. proposes machine-based above). structures. mindset team avoid unwanted bias. easy synthesize. (8,48,49) Feedback highlight synergistic opportunities improvement, e.g., flagging single, outlier result arising synthetic improvements. Synthetic Tractability Retrosynthesis PredictionThe synthesis compounds, "Make" phase, rate-limiting step DMTA cycle. (50) Therefore, tractability key aspect "Design" phase. applies AI-generated alike. Currently explicitly encode criterion growing one exciting domain invention AI computer-aided planning (CASP) (51) filtering full blown retrosynthesis analysis faster (52) chemists mind at least mental difficulty involved. reached sophistication efficiency sharing expertise knowledge daily, example reactivity building blocks intermediates. addition electronic laboratory notebooks (ELNs) block inventories, does (53,54) increasingly being chemists, scaffold-hopping, inspiration, easier routes. As areas, output disappointing first impression, if parity users' expected. (50,54) commercial CASP via web interface inspiration cross-check planning; quick links background literature useful. Evaluation expensive, proved difficult perhaps had unrealistic expectations performance. workflows, utility, but apply manual assessment last steps. Safety AssessmentIn tractability, safety risks considered. concern programs. Often, become apparent after deployed. Hence, flag earlier cheaply, receiving considerable attention. (55) Pure probability Drug-Induced Liver Injury (DILI), carbon atoms sp3 hybridization. (56) desirable aid prior synthesizing potentially reducing associated assays. tend rule-based supervised algorithms. (56,57) performance, beneficial incorporate vitro bile salt export pump (BSEP) transporter inhibition cellular cytotoxicity data) sophisticated (58)In contrast individual cover toxicity, omics provide snapshot state response exposure. Fortunately, high-throughput creation size train (59) patterns profiles adverse outcomes resulting organ toxicity. Once trained, risk high accuracy, outperforming existing (60) Moreover, works equally modalities biologics. create model, utilized transcriptomics platform (ScreenSeq) cell hundreds well-characterized different types serve reference PipelinesThe advent methods, increased prioritize numbers done applying alongside simpler property scores. Each scored criteria (drug-likeness, predicted attributes, properties, etc.), aggregated ad-hoc, correctly parametrized, rank promising round synthesis. technical deploying pipeline orchestration tasks, diversity good orchestrator needs file formats, handle environments, manage efficiently, scale-up jobs robust. Because evolving rapidly, designed makes add change deployed on.Automation save resources, while encoding practices improving reproducibility, facilitates objectivity BRADSHAW (61) machine-generated ideas processed thus trying selection. several (62−64) (65−67) platforms automating mind. much influenced Green et al. Besnard automate workflows wherever Knime high-performance computing (HPC) pipelining solution. cited authors integration, robustness, simplicity, flexibility. adapted ever-changing reusable, parts, Context Chemistry ProjectsThe increasing lower HPC costs, led pharma explore working. (61,68) At Evotec, group R&D isRD) responsible adapting cutting-edge stack, operational (Molecular Architects MAs), who collaboration partners. concept MAs (illustrated Figure 1) fuse expertise, foundation science consider facilitator establishing trust, attaining ambitious goals, expediting candidates. (i) right used, irrespective origin not, (ii) clean understood, (iii) clear met, (iv) bespoke created hypothesis minimum compounds.Figure 1Figure 1. Secret sauce excellence Evotec.High Resolution ImageDownload MS PowerPoint SlideThe D2MTL (Design-Decide-Make-Test-Learn) introduced evolution well-establish

Language: Английский

Citations

2

A Review of Large Language Models and Autonomous Agents in Chemistry DOI Creative Commons
Mayk Caldas Ramos, Christopher J. Collison, Andrew Dickson White

et al.

Chemical Science, Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 9, 2024

Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities these domains their potential to accelerate scientific discovery through automation. We also LLM-based autonomous agents: LLMs with a broader set of interact surrounding environment. These agents perform diverse tasks such paper scraping, interfacing automated laboratories, planning. As are an emerging topic, we extend the scope our beyond chemistry discuss across any domains. covers recent history, current capabilities, design agents, addressing specific challenges, opportunities, future directions chemistry. Key challenges include data quality integration, model interpretability, need for standard benchmarks, while point towards more sophisticated multi-modal enhanced collaboration between experimental methods. Due quick pace this field, repository has been built keep track latest studies: https://github.com/ur-whitelab/LLMs-in-science.

Language: Английский

Citations

13

AiZynthFinder 4.0: developments based on learnings from 3 years of industrial application DOI Creative Commons

Lakshidaa Saigiridharan,

Alan Kai Hassen,

Helen Lai

et al.

Journal of Cheminformatics, Journal Year: 2024, Volume and Issue: 16(1)

Published: May 23, 2024

Abstract We present an updated overview of the AiZynthFinder package for retrosynthesis planning. Since first version was released in 2020, we have added a substantial number new features based on user feedback. Feature enhancements include policies filter reactions, support any one-step model, scoring framework and several additional search algorithms. To exemplify typical use-cases software highlight some learnings, perform large-scale analysis hundred thousand target molecules from diverse sources. This looks at instance route shape, stock usage exploitation reaction space, points out strengths weaknesses our approach. The is as open-source educational purposes well to provide reference implementation core algorithms synthesis prediction. hope that releasing will further facilitate innovation developing novel methods synthetic fast, robust extensible can be downloaded https://github.com/MolecularAI/aizynthfinder .

Language: Английский

Citations

11

Investigations into the Efficiency of Computer-Aided Synthesis Planning DOI Creative Commons
Peter B. R. Hartog, Annie M. Westerlund, Igor V. Tetko

et al.

Journal of Chemical Information and Modeling, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 31, 2025

The efficiency of machine learning (ML) models is crucial to minimize inference times and reduce the carbon footprints deployed in production environments. Current employed retrosynthesis generate a synthesis route from target molecule purchasable compounds are prohibitively slow. model operates single-step fashion tree search algorithm by predicting reactant molecules given product as input. In this study, we investigate ability alternative transformer architectures, knowledge distillation (KD), simple hyper-parameter optimization decrease Chemformer model. Initially, assess closely related architectures conclude that these under-performed when using KD. Additionally, effects feature-based response-based KD together with hyper-parameters optimized based on sample time accuracy. We find although reducing size improving speed important, our results indicate multi-step more significantly influenced diversity confidence models. Based work, further research should use combination other techniques, continues prevent proper integration planning. However, Monte Carlo-based (MC) retrosynthesis, factors play role balancing exploration exploitation during process, often outweighing direct impact footprints.

Language: Английский

Citations

1

Application of Transformers to Chemical Synthesis DOI Creative Commons

Dong Jin,

Yuan Liang,

Zihao Xiong

et al.

Molecules, Journal Year: 2025, Volume and Issue: 30(3), P. 493 - 493

Published: Jan. 23, 2025

Efficient chemical synthesis is critical for the production of organic chemicals, particularly in pharmaceutical industry. Leveraging machine learning to predict and improve development efficiency has become a significant research focus modern chemistry. Among various models, Transformer, leading model natural language processing, revolutionized numerous fields due its powerful feature-extraction representation-learning capabilities. Recent applications demonstrated that Transformer models can also significantly enhance performance tasks, reaction prediction retrosynthetic planning. This article provides comprehensive review innovations qualitative tasks synthesis, with on technical approaches, advantages, challenges associated applying architecture reactions. Furthermore, we discuss future directions improving synthesis.

Language: Английский

Citations

0

Diverse and Feasible Retrosynthesis using GFlowNets DOI

Piotr Gaiński,

Michał Koziarski,

Krzysztof Maziarz

et al.

Information Sciences, Journal Year: 2025, Volume and Issue: unknown, P. 122194 - 122194

Published: April 1, 2025

Language: Английский

Citations

0

Applications of Transformers in Computational Chemistry: Recent Progress and Prospects DOI

Rui Wang,

Yujin Ji, Youyong Li

et al.

The Journal of Physical Chemistry Letters, Journal Year: 2024, Volume and Issue: unknown, P. 421 - 434

Published: Dec. 31, 2024

The powerful data processing and pattern recognition capabilities of machine learning (ML) technology have provided technical support for the innovation in computational chemistry. Compared with traditional ML deep (DL) techniques, transformers possess fine-grained feature-capturing abilities, which are able to efficiently accurately model dependencies long-sequence data, simulate complex diverse chemical spaces, explore logic behind data. In this Perspective, we provide an overview application transformer models We first introduce working principle analyze transformer-based architectures Next, practical applications a number specific scenarios such as property prediction structure generation. Finally, based on these research results, outlook field future.

Language: Английский

Citations

2