NWS: Natural Textual Backdoor Attacks Via Word Substitution DOI Open Access
Wei Du,

Tongxin Yuan,

Haodong Zhao

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2024, Volume and Issue: unknown, P. 4680 - 4684

Published: March 18, 2024

Backdoor attacks pose a serious security threat for natural language processing (NLP). Backdoored NLP models perform normally on clean text, but predict the attacker-specified target labels text containing triggers. Existing word-level textual backdoor rely either word insertion or substitution. Word-insertion can be easily detected by simple defenses. Meanwhile, word-substitution tend to substantially degrade fluency and semantic consistency of poisoned text. In this paper, we propose more substitution method implement covert attacks. Specifically, combine three different ways construct diverse synonym thesaurus We then train learnable selector producing using composite loss function poison fidelity terms. This enables automated selection minimal critical substitutions necessary induce backdoor. Experiments demonstrate our achieves high attack performance with less impact semantics. hope work raise awareness regarding subtle, fluent

Language: Английский

Backdoor Learning: A Survey DOI
Yiming Li, Yong Jiang, Zhifeng Li

et al.

IEEE Transactions on Neural Networks and Learning Systems, Journal Year: 2022, Volume and Issue: 35(1), P. 5 - 22

Published: June 22, 2022

Backdoor attack intends to embed hidden backdoors into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if backdoor is activated by attacker-specified triggers. This threat could happen when training process not fully controlled, such as third-party datasets or adopting models, which poses a new and realistic threat. Although learning an emerging rapidly growing research area, there still no comprehensive timely review of it. In this article, we present first survey realm. We summarize categorize existing attacks defenses based characteristics, provide unified framework for analyzing poisoning-based attacks. Besides, also analyze relation between relevant fields (i.e., adversarial data poisoning), widely adopted benchmark datasets. Finally, briefly outline certain future directions relying upon reviewed works. A curated list backdoor-related resources available at https://github.com/THUYimingLi/backdoor-learning-resources .

Language: Английский

Citations

343

A survey of safety and trustworthiness of large language models through the lens of verification and validation DOI Creative Commons
Xiaowei Huang, Wenjie Ruan, Wei Huang

et al.

Artificial Intelligence Review, Journal Year: 2024, Volume and Issue: 57(7)

Published: June 17, 2024

Abstract Large language models (LLMs) have exploded a new heatwave of AI for their ability to engage end-users in human-level conversations with detailed and articulate answers across many knowledge domains. In response fast adoption industrial applications, this survey concerns safety trustworthiness. First, we review known vulnerabilities limitations the LLMs, categorising them into inherent issues, attacks, unintended bugs. Then, consider if how Verification Validation (V&V) techniques, which been widely developed traditional software deep learning such as convolutional neural networks independent processes check alignment implementations against specifications, can be integrated further extended throughout lifecycle LLMs provide rigorous analysis trustworthiness applications. Specifically, four complementary techniques: falsification evaluation, verification, runtime monitoring, regulations ethical use. total, 370+ references are considered support quick understanding issues from perspective V&V. While intensive research has conducted identify yet practical methods called ensure requirements.

Language: Английский

Citations

33

On protecting the data privacy of Large Language Models (LLMs) and LLM agents: A literature review DOI Creative Commons
Biwei Yan, Kun Li, Minghui Xu

et al.

High-Confidence Computing, Journal Year: 2025, Volume and Issue: unknown, P. 100300 - 100300

Published: Feb. 1, 2025

Language: Английский

Citations

5

Concealed Data Poisoning Attacks on NLP Models DOI Creative Commons

Eric Wallace,

Tony Z. Zhao,

Shi Feng

et al.

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Journal Year: 2021, Volume and Issue: unknown

Published: Jan. 1, 2021

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary control whenever desired trigger phrase present in input. For instance, insert 50 poison examples into sentiment model’s set causes frequently predict Positive input contains “James Bond”. Crucially, craft these using gradient-based procedure so they do not mention phrase. We also apply our language modeling (“Apple iPhone” triggers negative generations) machine translation (“iced coffee” mistranslated as “hot coffee”). conclude proposing three defenses mitigate at some cost prediction accuracy or extra human annotation.

Language: Английский

Citations

77

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning DOI Creative Commons
Linyang Li,

Demin Song,

Xiaonan Li

et al.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Journal Year: 2021, Volume and Issue: unknown

Published: Jan. 1, 2021

Pre-Trained Models have been widely applied and recently proved vulnerable under backdoor attacks: the released pre-trained weights can be maliciously poisoned with certain triggers. When triggers are activated, even fine-tuned model will predict pre-defined labels, causing a security threat. These backdoors generated by poisoning methods erased changing hyper-parameters during fine-tuning or detected finding In this paper, we propose stronger weight-poisoning attack method that introduces layerwise weight strategy to plant deeper backdoors; also introduce combinatorial trigger cannot easily detected. The experiments on text classification tasks show previous defense resist our method, which indicates may provide hints for future robustness studies.

Language: Английский

Citations

63

Stealthy Backdoor Attack for Code Models DOI
Zhou Yang, Bowen Xu, Jie M. Zhang

et al.

IEEE Transactions on Software Engineering, Journal Year: 2024, Volume and Issue: 50(4), P. 721 - 741

Published: Feb. 9, 2024

Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code play a vital role in supporting downstream automated software engineering tasks. Most recently, models were revealed to be vulnerable backdoor attacks. A model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs injected with triggers activate the backdoors. Existing attacks use unstealthy easy-to-detect triggers. This paper aims investigate vulnerability xmlns:xlink="http://www.w3.org/1999/xlink">stealthy To this end, we propose fraidoor ( xmlns:xlink="http://www.w3.org/1999/xlink">A dversarial xmlns:xlink="http://www.w3.org/1999/xlink">F eature daptive Back xmlns:xlink="http://www.w3.org/1999/xlink">door ). achieves stealthiness by leveraging adversarial perturbations inject adaptive triggers into different inputs. We apply three widely adopted (CodeBERT, PLBART, CodeT5) two tasks (code summarization method name prediction). evaluate used defense methods find more unlikely detected than baseline methods. More specifically, when using spectral signature defense, around 85% bypass detection process. By contrast, only less 12% from previous work defense. When not applied, both baselines have almost perfect attack success rates. However, once rates decrease dramatically, while rate remains high. Our finding exposes security weaknesses under stealthy shows state-of-the-art cannot provide sufficient protection. call for research efforts understanding threats developing effective countermeasures.

Language: Английский

Citations

18

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks DOI
Haomiao Yang, Kunlan Xiang,

Mengyu Ge

et al.

IEEE Network, Journal Year: 2024, Volume and Issue: 38(6), P. 211 - 218

Published: Feb. 20, 2024

Language: Английский

Citations

18

Rethinking Stealthiness of Backdoor Attack against NLP Models DOI Creative Commons

Wenkai Yang,

Yankai Lin,

Peng Li

et al.

Published: Jan. 1, 2021

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, Xu Sun. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

Language: Английский

Citations

52

A Study of the Attention Abnormality in Trojaned BERTs DOI Creative Commons
Weimin Lyu,

Songzhu Zheng,

Tengfei Ma

et al.

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Journal Year: 2022, Volume and Issue: unknown

Published: Jan. 1, 2022

Trojan attacks raise serious security concerns. In this paper, we investigate the underlying mechanism of Trojaned BERT models. We observe attention focus drifting behavior models, i.e., when encountering an poisoned input, trigger token hijacks regardless context. provide a thorough qualitative and quantitative analysis phenomenon, revealing insights into mechanism. Based on observation, propose attention-based detector to distinguish models from clean ones. To best our knowledge, are first analyze develop based transformer's attention.

Language: Английский

Citations

30

Attention-Enhancing Backdoor Attacks Against BERT-based Models DOI Creative Commons
Weimin Lyu,

Songzhu Zheng,

Lu Pang

et al.

Published: Jan. 1, 2023

Recent studies have revealed that Backdoor Attacks can threaten the safety of natural language processing (NLP) models. Investigating strategies backdoor attacks will help to understand model's vulnerability. Most existing textual focus on generating stealthy triggers or modifying model weights. In this paper, we directly target interior structure neural networks and mechanism. We propose a novel Trojan Attention Loss (TAL), which enhances behavior by manipulating attention patterns. Our loss be applied different attacking methods boost their attack efficacy in terms successful rates poisoning rates. It applies not only traditional dirty-label attacks, but also more challenging clean-label attacks. validate our method backbone models (BERT, RoBERTa, DistilBERT) various tasks (Sentiment Analysis, Toxic Detection, Topic Classification).

Language: Английский

Citations

20