MacBehaviour: An R package for behavioural experimentation on large language models DOI Creative Commons

Xufeng Duan,

Shixuan Li,

Zhenguang G. Cai

et al.

Behavior Research Methods, Journal Year: 2024, Volume and Issue: 57(1)

Published: Dec. 18, 2024

Abstract The study of large language models (LLMs) and LLM-powered chatbots has gained significant attention in recent years, with researchers treating LLMs as participants psychological experiments. To facilitate this research, we developed an R package called “MacBehaviour “ ( https://github.com/xufengduan/MacBehaviour ), which interacts over 100 LLMs, including OpenAI's GPT family, the Claude Gemini, Llama other open-weight models. streamlines processes LLM behavioural experimentation by providing a comprehensive set functions for experiment design, stimuli presentation, model behaviour manipulation, logging responses token probabilities. With few lines code, can seamlessly up conduct experiments, making studies highly accessible. validate utility effectiveness “MacBehaviour,“ conducted three experiments on GPT-3.5 Turbo, Llama-2-7b-chat-hf, Vicuna-1.5-13b, replicating sound-gender association LLMs. results consistently demonstrated that these exhibit human-like tendencies to infer gender from novel personal names based their phonology, previously shown Cai et al. (2024). In conclusion, “MacBehaviour” is user-friendly simplifies standardises experimental process machine studies, offering valuable tool field.

Language: Английский

Turing Jest: Distributional Semantics and One‐Line Jokes DOI Creative Commons
Sean Trott,

Drew Walker,

Stephan F. Taylor

et al.

Cognitive Science, Journal Year: 2025, Volume and Issue: 49(5)

Published: May 1, 2025

Abstract Humor is an essential aspect of human experience, yet surprisingly, little known about how we recognize and understand humorous utterances. Most theories humor emphasize the role incongruity detection resolution (e.g., frame‐shifting), as well cognitive capacities like Theory Mind pragmatic reasoning. In multiple preregistered experiments, ask whether to what extent exposure purely linguistic input can account for ability one‐line jokes identify their entailments. We find that GPT‐3, a large language model (LLM) trained on only data, exhibits above‐chance performance in tasks designed test its detect, appreciate, comprehend jokes. exploratory work, also comprehension several open‐source LLMs, such Llama‐3 Mixtral. Although all LLMs tested fall short performance, both humans show tendency misclassify nonjokes with surprising endings Results suggest are remarkably adept at some involving jokes, but reveal key limitations distributional approaches meaning.

Language: Английский

Citations

0

High variability in LLMs’ analogical reasoning DOI
Andrea Gregor de Varda, Chiara Saponaro, Marco Marelli

et al.

Nature Human Behaviour, Journal Year: 2025, Volume and Issue: unknown

Published: June 4, 2025

Language: Английский

Citations

0

Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models DOI Creative Commons
Viktor Kewenig, Andrew K. Lampinen, Samuel A. Nastase

et al.

Research Square (Research Square), Journal Year: 2024, Volume and Issue: unknown

Published: April 3, 2024

Abstract The potential of multimodal generative artificial intelligence (mAI) to replicate human grounded language understanding, including the pragmatic, context-rich aspects communication, remains be clarified. Humans are known use salient features, such as visual cues, facilitate processing upcoming words. Correspondingly, computational models can integrate and linguistic data using a attention mechanism assign next-word probabilities. To test whether these processes align, we tasked both participants (N = 200) well several state-of-the-art with evaluating predictability forthcoming words after viewing short audio-only or audio-visual clips speech. During task, model’s weights were recorded was indexed via eye tracking. Results show that estimates from humans aligned more closely scores generated vs. their unimodal counterparts. Furthermore, an doubled alignment judgments when context facilitated predictions. In cases, patches tracking significantly overlapped. Our results indicate improved modeling naturalistic in mAI does not merely depend on training diet but driven by multimodality combination attention-based architectures. alike leverage predictive constraints information attending relevant features input.

Language: Английский

Citations

3

WITHDRAWN: Prompt Engineering GPT-4 to Answer Patient Inquiries: A Real-Time Implementation in the Electronic Health Record across Provider Clinics DOI Open Access
Majid Afshar, Yanjun Gao,

Graham Wills

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Jan. 24, 2024

Withdrawal Statement The authors have withdrawn their manuscript owing to needing additional internal review. Therefore, the do not wish this work be cited as a reference for project. If you any questions, please contact corresponding author.

Language: Английский

Citations

2

Exceptions, Instantiations, and Overgeneralization: Insights into How Language Models Process Generics DOI Creative Commons

Emily Allaway,

Chandra Bhagavatula, Jena D. Hwang

et al.

Computational Linguistics, Journal Year: 2024, Volume and Issue: unknown, P. 1291 - 1355

Published: July 30, 2024

Abstract Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and reasoning tasks. In this work, we investigate LLMs’ capabilities generalization using particularly challenging type statement: generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they generalize over instantiations sparrows yet hold true even in the presence exceptions penguins not). For humans, these generic play fundamental role cognition, concept acquisition, intuitive reasoning. We how LLMs respond to reason about To end, first propose framework grounded pragmatics automatically generate both – collectively exemplars. make use focus—a pragmatic phenomenon that highlights meaning-bearing elements sentence—to capture full range interpretations generics across different contexts use. This allows us derive precise logical definitions exemplars operationalize them from LLMs. Using our system, dataset ∼370k ∼17k conduct human validation sample generated data. final Humans documented tendency conflate universally quantified statements all with Therefore, probe whether exhibit similar overgeneralization behavior terms quantification property inheritance. find show evidence overgeneralization, although sometimes struggle exceptions. Furthermore, may non-logical humans when considering inheritance

Language: Английский

Citations

1

Reply to Hu et al.: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance DOI Creative Commons
Evelina Leivada, Fritz Günther, Vittoria Dentella

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2024, Volume and Issue: 121(36)

Published: Aug. 26, 2024

Large volumes of liquid water transiently existed on the surface Mars more than 3 billion years ago. Much this is hypothesized to have been sequestered in subsurface or lost space. We use rock physics models and Bayesian inversion ...

Language: Английский

Citations

1

Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased? DOI Creative Commons
Vagrant Gautam,

Eileen Bingert,

Dawei Zhu

et al.

Transactions of the Association for Computational Linguistics, Journal Year: 2024, Volume and Issue: 12, P. 1755 - 1779

Published: Jan. 1, 2024

Abstract Robust, faithful, and harm-free pronoun use for individuals is an important goal language model development as their increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce task fidelity: Given context introducing co-referring entity pronoun, reuse correct later. We present RUFF, carefully designed dataset over 5 million instances robust fidelity in English, evaluate 37 variants from nine popular families, across architectures (encoder-only, decoder-only, encoder-decoder) scales (11M-70B parameters). When individual introduced with models can mostly faithfully this next sentence, they are significantly worse she/her/her, singular they, neopronouns. Moreover, easily distracted by non-adversarial sentences discussing other people; even sentence distractor causes accuracy drop on average 34 percentage points. Our results show that not robust, simple, naturalistic setting where humans achieve nearly 100% accuracy. encourage researchers bridge gaps find reasoning settings superficial repetition might inflate perceptions performance.

Language: Английский

Citations

1

BLiMP-NL: A corpus of Dutch minimal pairs and grammaticality judgements for language model evaluation DOI Open Access

Michelle Suijkerbuijk,

Zoë Prins,

Marianne de Heer Kloots

et al.

Published: April 15, 2024

We present a corpus of 8400 Dutch sentence pairs, intended for the grammatical evaluation language models. Each pair consists and minimally different ungrammatical sentence. The covers 84 paradigms, classified into 22 syntactic phenomena. Nine sentences each paradigm are rated acceptability by at least 30 participants each, same 9 reading times recorded per word, through self-paced reading. Ten sentence-pairs were created hand, while remaining ninety semi-automatically with help ChatGPT. Here, we report on construction dataset, measured ratings times, as well extent to which variety models can be used predict both ground-truth grammaticality human ratings.

Language: Английский

Citations

0

On the influence of discourse connectives on the predictions of humans and language models DOI Creative Commons
James Britton, Yan Cong, Yu-Yin Hsu

et al.

Frontiers in Human Neuroscience, Journal Year: 2024, Volume and Issue: 18

Published: Sept. 30, 2024

Psycholinguistic literature has consistently shown that humans rely on a rich and organized understanding of event knowledge to predict the forthcoming linguistic input during online sentence comprehension. We, authors, expect sentences maintain coherence with preceding context, making congruent sequences easier process than incongruent ones. It is widely known discourse relations between (e.g., temporal, contingency, comparison) are generally made explicit through specific particles, as connectives , and, but, because, after ). However, some easily accessible speakers, given their knowledge, can also be left implicit. The goal this paper investigate importance in prediction events human language processing pretrained models, focus concessives contrastives, which signal comprehenders event-related predictions have reversed . Inspired by previous work, we built comprehensive set story stimuli Italian Mandarin Chinese differ plausibility situation being described presence or absence connective. We collected judgments reading times from native speakers for stimuli. Moreover, correlated results experiments computational modeling, using Surprisal scores obtained via Transformer-based models. judgements were seven-point Likert scale analyzed cumulative link mixed modeling (CLMM), while model surprisal linear effects regression (LMER). found NLMs sensitive connectives, although they struggle reproduce expectation reversal due connective changing scenario; even less aligned data, no either Surprisal.

Language: Английский

Citations

0

Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models DOI Creative Commons
Andreas Waldis, Andreas Waldis,

Yotam Perlitz

et al.

Transactions of the Association for Computational Linguistics, Journal Year: 2024, Volume and Issue: 12, P. 1616 - 1647

Published: Jan. 1, 2024

Abstract We introduce Holmes, a new benchmark designed to assess language models’ (LMs’) linguistic competence—their unconscious understanding of phenomena. Specifically, we use classifier-based probing examine LMs’ internal representations regarding distinct phenomena (e.g., part-of-speech tagging). As result, meet recent calls disentangle competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing review over 270 studies and include more than 200 datasets syntax, morphology, semantics, reasoning, discourse Analyzing 50 LMs reveals that, aligned with known trends, their correlates model size. However, surprisingly, architecture instruction tuning also significantly influence performance, particularly morphology syntax. Finally, propose FlashHolmes, streamlined version that reduces the computation load while maintaining high-ranking precision.

Language: Английский

Citations

0