Cited by NWS: Natural Textual Backdoor Attacks Via Word Substitution

Yiming Li, Yong Jiang, Zhifeng Li

и другие.

IEEE Transactions on Neural Networks and Learning Systems, Год журнала: 2022, Номер 35(1), С. 5 - 22

Опубликована: Июнь 22, 2022

Backdoor attack intends to embed hidden backdoors into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if backdoor is activated by attacker-specified triggers. This threat could happen when training process not fully controlled, such as third-party datasets or adopting models, which poses a new and realistic threat. Although learning an emerging rapidly growing research area, there still no comprehensive timely review of it. In this article, we present first survey realm. We summarize categorize existing attacks defenses based characteristics, provide unified framework for analyzing poisoning-based attacks. Besides, also analyze relation between relevant fields (i.e., adversarial data poisoning), widely adopted benchmark datasets. Finally, briefly outline certain future directions relying upon reviewed works. A curated list backdoor-related resources available at https://github.com/THUYimingLi/backdoor-learning-resources .

Язык: Английский

Процитировано

343

A survey of safety and trustworthiness of large language models through the lens of verification and validation DOI

Xiaowei Huang, Wenjie Ruan, Wei Huang

и другие.

Artificial Intelligence Review, Год журнала: 2024, Номер 57(7)

Опубликована: Июнь 17, 2024

Abstract Large language models (LLMs) have exploded a new heatwave of AI for their ability to engage end-users in human-level conversations with detailed and articulate answers across many knowledge domains. In response fast adoption industrial applications, this survey concerns safety trustworthiness. First, we review known vulnerabilities limitations the LLMs, categorising them into inherent issues, attacks, unintended bugs. Then, consider if how Verification Validation (V&V) techniques, which been widely developed traditional software deep learning such as convolutional neural networks independent processes check alignment implementations against specifications, can be integrated further extended throughout lifecycle LLMs provide rigorous analysis trustworthiness applications. Specifically, four complementary techniques: falsification evaluation, verification, runtime monitoring, regulations ethical use. total, 370+ references are considered support quick understanding issues from perspective V&V. While intensive research has conducted identify yet practical methods called ensure requirements.

Язык: Английский

Процитировано

On protecting the data privacy of Large Language Models (LLMs) and LLM agents: A literature review DOI

Biwei Yan, Kun Li, Minghui Xu

и другие.

High-Confidence Computing, Год журнала: 2025, Номер unknown, С. 100300 - 100300

Опубликована: Фев. 1, 2025

Язык: Английский

Процитировано

Concealed Data Poisoning Attacks on NLP Models DOI

Eric Wallace,

Tony Z. Zhao,

Shi Feng

и другие.

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Год журнала: 2021, Номер unknown

Опубликована: Янв. 1, 2021

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary control whenever desired trigger phrase present in input. For instance, insert 50 poison examples into sentiment model’s set causes frequently predict Positive input contains “James Bond”. Crucially, craft these using gradient-based procedure so they do not mention phrase. We also apply our language modeling (“Apple iPhone” triggers negative generations) machine translation (“iced coffee” mistranslated as “hot coffee”). conclude proposing three defenses mitigate at some cost prediction accuracy or extra human annotation.

Язык: Английский

Процитировано

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning DOI

Linyang Li,

Demin Song,

Xiaonan Li

и другие.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Год журнала: 2021, Номер unknown

Опубликована: Янв. 1, 2021

Pre-Trained Models have been widely applied and recently proved vulnerable under backdoor attacks: the released pre-trained weights can be maliciously poisoned with certain triggers. When triggers are activated, even fine-tuned model will predict pre-defined labels, causing a security threat. These backdoors generated by poisoning methods erased changing hyper-parameters during fine-tuning or detected finding In this paper, we propose stronger weight-poisoning attack method that introduces layerwise weight strategy to plant deeper backdoors; also introduce combinatorial trigger cannot easily detected. The experiments on text classification tasks show previous defense resist our method, which indicates may provide hints for future robustness studies.

Язык: Английский

Процитировано

Stealthy Backdoor Attack for Code Models DOI

Zhou Yang, Bowen Xu, Jie M. Zhang

и другие.

IEEE Transactions on Software Engineering, Год журнала: 2024, Номер 50(4), С. 721 - 741

Опубликована: Фев. 9, 2024

Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code play a vital role in supporting downstream automated software engineering tasks. Most recently, models were revealed to be vulnerable backdoor attacks. A model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs injected with triggers activate the backdoors. Existing attacks use unstealthy easy-to-detect triggers. This paper aims investigate vulnerability xmlns:xlink="http://www.w3.org/1999/xlink">stealthy To this end, we propose fraidoor ( xmlns:xlink="http://www.w3.org/1999/xlink">A dversarial xmlns:xlink="http://www.w3.org/1999/xlink">F eature daptive Back xmlns:xlink="http://www.w3.org/1999/xlink">door ). achieves stealthiness by leveraging adversarial perturbations inject adaptive triggers into different inputs. We apply three widely adopted (CodeBERT, PLBART, CodeT5) two tasks (code summarization method name prediction). evaluate used defense methods find more unlikely detected than baseline methods. More specifically, when using spectral signature defense, around 85% bypass detection process. By contrast, only less 12% from previous work defense. When not applied, both baselines have almost perfect attack success rates. However, once rates decrease dramatically, while rate remains high. Our finding exposes security weaknesses under stealthy shows state-of-the-art cannot provide sufficient protection. call for research efforts understanding threats developing effective countermeasures.

Язык: Английский

Процитировано

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks DOI

Haomiao Yang, Kunlan Xiang,

Mengyu Ge

и другие.

IEEE Network, Год журнала: 2024, Номер 38(6), С. 211 - 218

Опубликована: Фев. 20, 2024

Язык: Английский

Процитировано

Rethinking Stealthiness of Backdoor Attack against NLP Models DOI

Wenkai Yang,

Yankai Lin,

Peng Li

и другие.

Опубликована: Янв. 1, 2021

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, Xu Sun. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

Язык: Английский

Процитировано

A Study of the Attention Abnormality in Trojaned BERTs DOI

Weimin Lyu,

Songzhu Zheng,

Tengfei Ma

и другие.

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Год журнала: 2022, Номер unknown

Опубликована: Янв. 1, 2022

Trojan attacks raise serious security concerns. In this paper, we investigate the underlying mechanism of Trojaned BERT models. We observe attention focus drifting behavior models, i.e., when encountering an poisoned input, trigger token hijacks regardless context. provide a thorough qualitative and quantitative analysis phenomenon, revealing insights into mechanism. Based on observation, propose attention-based detector to distinguish models from clean ones. To best our knowledge, are first analyze develop based transformer's attention.

Язык: Английский

Процитировано

Attention-Enhancing Backdoor Attacks Against BERT-based Models DOI

Weimin Lyu,

Songzhu Zheng,

Lu Pang

и другие.

Опубликована: Янв. 1, 2023

Recent studies have revealed that Backdoor Attacks can threaten the safety of natural language processing (NLP) models. Investigating strategies backdoor attacks will help to understand model's vulnerability. Most existing textual focus on generating stealthy triggers or modifying model weights. In this paper, we directly target interior structure neural networks and mechanism. We propose a novel Trojan Attention Loss (TAL), which enhances behavior by manipulating attention patterns. Our loss be applied different attacking methods boost their attack efficacy in terms successful rates poisoning rates. It applies not only traditional dirty-label attacks, but also more challenging clean-label attacks. validate our method backbone models (BERT, RoBERTa, DistilBERT) various tasks (Sentiment Analysis, Toxic Detection, Topic Classification).

Язык: Английский

Процитировано