
Published: May 6, 2024
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and derived computational database tmQM is not conducive to application-specific modeling development structure–property relationships. Here, we employ both supervised unsupervised natural language processing (NLP) techniques link experimentally synthesized compounds in their respective applications. Leveraging NLP models, curate four distinct datasets: tmCAT for catalysis, tmPHOTO photophysical activity, tmBIO biological relevance, tmSCO magnetism. Analyzing substructures within each dataset reveals common motifs designated We then use these structures augment our initial datasets application, yielding a total 21,631 tmCAT, 4,599 tmPHOTO, 2,782 tmBIO, 983 tmSCO. These are expected accelerate more targeted screening refined relationships with machine learning.
Language: Английский