Cited by MacBehaviour: An R package for behavioural experimentation on large language models

Large language models surpass human experts in predicting neuroscience results DOI

Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun

и другие.

Nature Human Behaviour, Год журнала: 2024, Номер unknown

Опубликована: Ноя. 27, 2024

Abstract Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer solution. LLMs trained the vast scientific literature could integrate noisy yet interrelated findings to forecast novel results better than experts. Here, evaluate this possibility, we created BrainBench, forward-looking benchmark for predicting neuroscience results. We find surpass experts in experimental outcomes. BrainGPT, an LLM tuned literature, performed yet. Like experts, when indicated high confidence their predictions, responses were more likely be correct, which presages future where assist humans making discoveries. Our approach is not specific and transferable other knowledge-intensive endeavours.

Язык: Английский

Процитировано

Language models align with human judgments on key grammatical constructions DOI

Jennifer Hu, Kyle Mahowald, Gary Lupyan

и другие.

Proceedings of the National Academy of Sciences, Год журнала: 2024, Номер 121(36)

Опубликована: Авг. 26, 2024

Large volumes of liquid water transiently existed on the surface Mars more than 3 billion years ago. Much this is hypothesized to have been sequestered in subsurface or lost space. We use rock physics models and Bayesian inversion ...

Язык: Английский

Процитировано

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning DOI

Vittoria Dentella, Fritz Günther, Elliot Murphy

и другие.

Scientific Reports, Год журнала: 2024, Номер 14(1)

Опубликована: Ноя. 14, 2024

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering education. Their success specialized tasks has led the claim they possess human-like linguistic capabilities related compositional understanding reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according which easy skills hard. We systematically assess 7 state-of-the-art models on a novel benchmark. answered series of comprehension questions, each prompted multiple times two settings, permitting one-word or open-length replies. Each targets short text featuring high-frequency constructions. To establish baseline for achieving performance, we tested 400 humans same prompts. Based dataset n = 26,680 datapoints, discovered LLMs perform at chance accuracy waver considerably their answers. Quantitatively, outperformed humans, qualitatively answers showcase distinctly non-human errors language understanding. interpret this evidence as suggesting that, despite usefulness various tasks, current AI fall way matches argue may be due lack operator regulating grammatical semantic information.

Язык: Английский

Процитировано

Explicitly unbiased large language models still form biased associations DOI

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky

и другие.

Proceedings of the National Academy of Sciences, Год журнала: 2025, Номер 122(8)

Опубликована: Фев. 20, 2025

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such biases be a challenge: As LLMs become increasingly proprietary, it may not possible access their embeddings and apply existing measures; furthermore, are primarily concern if they affect the actual decisions that these systems make. We address both challenges by introducing two measures: LLM Word Association Test, prompt-based method for revealing bias; Relative Decision strategy detect discrimination in contextual decisions. Both measures based on psychological research: Test adapts Implicit widely used study automatic associations between concepts held human minds; operationalizes results indicating relative evaluations candidates, absolute assessing each independently, more diagnostic of Using measures, we found pervasive stereotype mirroring those society 8 value-aligned across 4 categories (race, gender, religion, health) 21 stereotypes (such as race criminality, weapons, gender science, age negativity). These draw from psychology's long history research into measuring purely observable behavior; expose nuanced proprietary appear unbiased according standard benchmarks.

Язык: Английский

Процитировано

Evaluating the language abilities of Large Language Models vs. humans: Three caveats DOI

Evelina Leivada, Vittoria Dentella, Fritz Günther

и другие.

Biolinguistics, Год журнала: 2024, Номер 18

Опубликована: Апрель 19, 2024

We identify and analyze three caveats that may arise when analyzing the linguistic abilities of Large Language Models. The problem unlicensed generalizations refers to danger interpreting performance in one task as predictive models’ overall capabilities, based on assumption because a specific is indicative certain underlying capabilities humans, same association holds for models. human-like paradox lacking human comparisons, while at time attributing Last, double standards use tasks methodologies either cannot be applied humans or they are evaluated differently models vs. humans. While we recognize impressive LLMs, conclude claims about human-likeness grammatical domain premature.

Язык: Английский

Процитировано

Can language models handle recursively nested grammatical structures? A case study on comparing models and humans DOI

Andrew K. Lampinen

Computational Linguistics, Год журнала: 2024, Номер unknown, С. 1 - 36

Опубликована: Июль 30, 2024

Abstract How should we compare the capabilities of language models (LMs) and humans? In this article, I draw inspiration from comparative psychology to highlight challenges in these comparisons. focus on a case study: processing recursively nested grammatical structures. Prior work suggests that LMs cannot process structures as reliably humans can. However, were provided with instructions substantial training, while evaluated zero-shot. therefore match evaluation more closely. Providing large simple prompt—with substantially less content than human training—allows consistently outperform results, even deeply conditions tested humans. Furthermore, effects prompting are robust particular vocabulary used prompt. Finally, reanalyzing existing data may not perform above chance at difficult initially. Thus, indeed humans, when comparably. This study highlights how discrepancies methods can confound comparisons conclude by reflecting broader challenge comparing model capabilities, an important difference between evaluating cognitive foundation models.

Язык: Английский

Процитировано

The acceptability and validity of AI-generated psycholinguistic stimuli DOI

Alaa Alzahrani

Heliyon, Год журнала: 2025, Номер 11(2), С. e42083 - e42083

Опубликована: Янв. 1, 2025

Sentence stimuli pervade psycholinguistics research. Yet, limited attention has been paid to the automatic construction of sentence stimuli. Given their linguistic capabilities, this study investigated efficacy ChatGPT in generating and AI tools producing auditory In three psycholinguistic experiments, examined acceptability validity AI-formulated sentences written one two languages: English Arabic. Experiment 1 3, participants gave AI-generated similar or higher ratings than human-composed 2, Arabic received lower counterparts. The AI-developed relied on design, with only 2 demonstrating target effect. These results highlight promising role as a developer, which could facilitate research increase its diversity. Implications for were discussed.

Язык: Английский

Процитировано

Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction DOI

Qinyuan Wu, Mohammad Aflah Khan, Soumi Das

и другие.

Опубликована: Фев. 26, 2025

In this paper, we focus on the challenging task of reliably estimating factual knowledge that is embedded inside large language models (LLMs).To avoid reliability concerns with prior approaches, propose to eliminate prompt engineering when probing LLMs for knowledge.Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages in-context learning ability communicate both question as well expected answer format.Our estimator conceptually simpler (i.e., doesn't depend meta-linguistic judgments LLMs) and easier apply not LLM-specific), demonstrate it can surface more latent in LLMs.We also investigate how different design choices affect performance ZP-LKE.Using proposed estimator, perform a large-scale evaluation variety open-source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over set relations facts from Wikidata base.We observe differences between model families sizes, some are consistently better known than others but differ precise they know, base their finetuned counterparts.

Язык: Английский

Процитировано

Derivational morphology reveals analogical generalization in large language models DOI

Valentin Hofmann,

Leonie Weissweiler,

David R. Mortensen

и другие.

Proceedings of the National Academy of Sciences, Год журнала: 2025, Номер 122(19)

Опубликована: Май 9, 2025

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which skills of LLMs resemble rules. As yet, it is not known whether could equally well be explained as result analogy. A key shortcoming prior research its focus on regular phenomena, for rule-based and analogical approaches make same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, displays notable variability. We introduce a method investigating LLMs: Focusing GPT-J, fit cognitive that instantiate learning LLM training data compare their predictions set nonce adjectives those LLM, allowing us draw direct conclusions regarding underlying mechanisms. expected, explain GPT-J nominalization patterns. However, variable patterns, model provides much better match. Furthermore, GPT-J’s behavior sensitive individual word frequencies, even forms, consistent an account but one. These findings refute hypothesis involves rules, suggesting analogy mechanism. Overall, our study suggests processes play bigger role than previously thought.

Язык: Английский

Процитировано

The “LLM World of Words” English free association norms generated by large language models DOI

Katherine Abramski, Riccardo Improta, Giulio Rossetti

и другие.

Scientific Data, Год журнала: 2025, Номер 12(1)

Опубликована: Май 16, 2025

Free associations have been extensively used in psychology and linguistics for studying how conceptual knowledge is organized. Recently, the potential of applying a similar approach investigating encoded LLMs has emerged, specifically as method LLM biases. However, absence large-scale LLM-generated free association norms that are comparable with human-generated an obstacle to this research direction. To address this, we create new dataset modeled after "Small World Words"(SWOW) nearly 12,000 cue words. We prompt three (Mistral, Llama3, Haiku) same cues those SWOW generate novel datasets, "LLM Words" (LWOW). From construct network models semantic memory represent possessed by humans LLMs. validate datasets simulating priming within models, briefly discuss can be implicit biases

Язык: Английский

Процитировано