Cited by Revisiting the fragility of influence functions

Explainability for Large Language Models: A Survey DOI

Haiyan Zhao, Hanjie Chen, Fan Yang

et al.

ACM Transactions on Intelligent Systems and Technology, Journal Year: 2024, Volume and Issue: 15(2), P. 1 - 38

Published: Jan. 2, 2024

Large language models (LLMs) have demonstrated impressive capabilities in natural processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding explaining these is crucial elucidating behaviors, limitations, social impacts. In article, we introduce a taxonomy explainability techniques provide structured overview methods Transformer-based models. We categorize based on the training paradigms LLMs: traditional fine-tuning-based paradigm prompting-based paradigm. For each paradigm, summarize goals dominant approaches generating local explanations individual predictions global overall model knowledge. also discuss metrics evaluating generated how can be leveraged to debug improve performance. Lastly, examine key challenges emerging opportunities explanation era LLMs comparison conventional deep learning

Language: Английский

Citations

191

Post-hoc Interpretability for Neural NLP: A Survey DOI

Andreas Nygaard Madsen, Siva Reddy,

Sarath Chandar

et al.

ACM Computing Surveys, Journal Year: 2022, Volume and Issue: 55(8), P. 1 - 42

Published: July 9, 2022

Neural networks for NLP are becoming increasingly complex and widespread, there is a growing concern if these models responsible to use. Explaining helps address the safety ethical concerns essential accountability. Interpretability serves provide explanations in terms that understandable humans. Additionally, post-hoc methods after model learned generally model-agnostic. This survey provides categorization of how recent interpretability communicate humans, it discusses each method in-depth, they validated, as latter often common concern.

Language: Английский

Citations

126

Explainability of artificial intelligence methods, applications and challenges: A comprehensive survey DOI

Weiping Ding, Mohamed Abdel‐Basset, Hossam Hawash

et al.

Information Sciences, Journal Year: 2022, Volume and Issue: 615, P. 238 - 292

Published: Oct. 10, 2022

Language: Английский

Citations

118

Leveraging explanations in interactive machine learning: An overview DOI

Stefano Teso,

Öznur Alkan,

Wolfgang Stammer

et al.

Frontiers in Artificial Intelligence, Journal Year: 2023, Volume and Issue: 6

Published: Feb. 23, 2023

Explanations have gained an increasing level of interest in the AI and Machine Learning (ML) communities order to improve model transparency allow users form a mental trained ML model. However, explanations can go beyond this one way communication as mechanism elicit user control, because once understand, they then provide feedback. The goal paper is present overview research where are combined with interactive capabilities mean learn new models from scratch edit debug existing ones. To end, we draw conceptual map state-of-the-art, grouping relevant approaches based on their intended purpose how structure interaction, highlighting similarities differences between them. We also discuss open issues outline possible directions forward, hope spurring further blooming topic.

Language: Английский

Citations

Training data influence analysis and estimation: a survey DOI

Zayd Hammoudeh, Daniel Lowd

Machine Learning, Journal Year: 2024, Volume and Issue: 113(5), P. 2351 - 2403

Published: March 29, 2024

Language: Английский

Citations

Machine Unlearning of Features and Labels DOI

Alexander Warnecke, Lukas Pirch, Christian Wressnegger

et al.

Published: Jan. 1, 2023

Removing information from a machine learning model is non-trivial task that requires to partially revert the training process.This unavoidable when sensitive data, such as credit card numbers or passwords, accidentally enter and need be removed afterwards.Recently, different concepts for unlearning have been proposed address this problem.While these approaches are effective in removing individual data points, they do not scale scenarios where larger groups of features labels reverted.In paper, we propose first method labels.Our approach builds on concept influence functions realizes through closed-form updates parameters.It enables adapt retrospectively, thereby correcting leaks privacy issues.For models with strongly convex loss functions, our provides certified theoretical guarantees.For non-convex losses, empirically show significantly faster than other strategies.

Language: Английский

Citations

Explanation-Based Human Debugging of NLP Models: A Survey DOI

Piyawat Lertvittayakumjorn, Francesca Toni

Transactions of the Association for Computational Linguistics, Journal Year: 2021, Volume and Issue: 9, P. 1508 - 1528

Published: Jan. 1, 2021

Abstract Debugging a machine learning model is hard since the bug usually involves training data and process. This becomes even harder for an opaque deep if we have no clue about how actually works. In this survey, review papers that exploit explanations to enable humans give feedback debug NLP models. We call problem explanation-based human debugging (EBHD). particular, categorize discuss existing work along three dimensions of EBHD (the context, workflow, experimental setting), compile findings on components affect providers, highlight open problems could be future research directions.

Language: Английский

Citations

Towards Tracing Knowledge in Language Models Back to the Training Data DOI

Ekin Akyürek,

Tolga Bolukbasi,

Frederick Liu

et al.

Published: Jan. 1, 2022

Language models (LMs) have been shown to memorize a great deal of factual knowledge contained in their training data. But when an LM generates assertion, it is often difficult determine where learned this information and whether true. In paper, we propose the problem fact tracing: identifying which examples taught generate particular assertion. Prior work on data attribution (TDA) may offer effective tools for such examples, known as “proponents”. We present first quantitative benchmark evaluate this. compare two popular families TDA methods — gradient-based embedding-based find that much headroom remains. For example, both lower proponent-retrieval precision than retrieval baseline (BM25) does not access at all. identify key challenges be necessary further improvement overcoming gradient saturation, also show how several nuanced implementation details existing neural can significantly improve overall tracing performance.

Language: Английский

Citations

Regularizing Second-Order Influences for Continual Learning DOI

Zhicheng Sun,

Yadong Mu, Gang Hua

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 20166 - 20175

Published: June 1, 2023

Continual learning aims to learn on non-stationary data streams without catastrophically forgetting previous knowledge. Prevalent replay-based methods address this challenge by rehearsing a small buffer holding the seen data, for which delicate sample selection strategy is required. However, existing schemes typically seek only maximize utility of ongoing selection, overlooking interference between successive rounds selection. Motivated this, we dissect interaction sequential steps within framework built influence functions. We manage identify new class second-order influences that will gradually amplify incidental bias in replay and compromise process. To regularize effects, novel objective proposed, also has clear connections two widely adopted criteria. Furthermore, present an efficient implementation optimizing proposed criterion. Experiments multiple continual benchmarks demonstrate advantage our approach over state-of-the-art methods. Code available at https://github.com/feifeiobama/InfluenceCL.

Language: Английский

Citations

FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes DOI

Haonan Wang, Ziwei Wu, Jingrui He

et al.

Published: March 4, 2024

Most fair machine learning methods either highly rely on the sensitive information of training samples or require a large modification target models, which hinders their practical application. To address this issue, we propose two-stage algorithm named FAIRIF. It minimizes loss over reweighted data set (second stage) where sample weights are computed to balance model performance across different demographic groups (first stage). FAIRIF can be applied wide range models trained by stochastic gradient descent without changing model, while only requiring group annotations small validation compute weights. Theoretically, show that, in classification setting, three notions disparity among mitigated with Experiments synthetic sets demonstrate that yields better fairness-utility trade-offs against various types bias; and real-world sets, effectiveness scalability Moreover, as evidenced experiments pretrained is able alleviate unfairness issue hurting performance.

Language: Английский

Citations