Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes DOI Creative Commons
Ilia Kevlishvili,

Roland St. Michel,

Aaron Garrison

et al.

Published: May 6, 2024

The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and derived computational database tmQM is not conducive to application-specific modeling development structure–property relationships. Here, we employ both supervised unsupervised natural language processing (NLP) techniques link experimentally synthesized compounds in their respective applications. Leveraging NLP models, curate four distinct datasets: tmCAT for catalysis, tmPHOTO photophysical activity, tmBIO biological relevance, tmSCO magnetism. Analyzing substructures within each dataset reveals common motifs designated We then use these structures augment our initial datasets application, yielding a total 21,631 tmCAT, 4,599 tmPHOTO, 2,782 tmBIO, 983 tmSCO. These are expected accelerate more targeted screening refined relationships with machine learning.

Language: Английский

Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes DOI Creative Commons
Ilia Kevlishvili,

Roland St. Michel,

Aaron Garrison

et al.

Published: May 6, 2024

The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and derived computational database tmQM is not conducive to application-specific modeling development structure–property relationships. Here, we employ both supervised unsupervised natural language processing (NLP) techniques link experimentally synthesized compounds in their respective applications. Leveraging NLP models, curate four distinct datasets: tmCAT for catalysis, tmPHOTO photophysical activity, tmBIO biological relevance, tmSCO magnetism. Analyzing substructures within each dataset reveals common motifs designated We then use these structures augment our initial datasets application, yielding a total 21,631 tmCAT, 4,599 tmPHOTO, 2,782 tmBIO, 983 tmSCO. These are expected accelerate more targeted screening refined relationships with machine learning.

Language: Английский

Citations

1