
bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown
Published: May 14, 2025
Abstract We show that recent (mid-to-late 2024) commercial large language models (LLMs) are capable of good quality metadata extraction and annotation with very little work on the part investigators for several exemplar real-world tasks in neuroimaging literature. investigated GPT-4o LLM from OpenAI which performed comparably groups specially trained supervised human annotators. The achieves similar performance to humans, between 0.91 0.97 zero-shot prompts without feedback LLM. Reviewing disagreements gold standard annotations we note actual errors comparable most cases, many cases these not errors. Based specific types tested, exceptionally reviewed gold-standard correct values, is usable at scale. encourage other research develop make available more specialized “micro-benchmarks,” like ones provide here, testing both LLMs, complex agent systems tasks.
Language: Английский