Cited by A Survey of Reasoning with Foundation Models

A survey of safety and trustworthiness of large language models through the lens of verification and validation DOI

Xiaowei Huang, Wenjie Ruan, Wei Huang

et al.

Artificial Intelligence Review, Journal Year: 2024, Volume and Issue: 57(7)

Published: June 17, 2024

Abstract Large language models (LLMs) have exploded a new heatwave of AI for their ability to engage end-users in human-level conversations with detailed and articulate answers across many knowledge domains. In response fast adoption industrial applications, this survey concerns safety trustworthiness. First, we review known vulnerabilities limitations the LLMs, categorising them into inherent issues, attacks, unintended bugs. Then, consider if how Verification Validation (V&V) techniques, which been widely developed traditional software deep learning such as convolutional neural networks independent processes check alignment implementations against specifications, can be integrated further extended throughout lifecycle LLMs provide rigorous analysis trustworthiness applications. Specifically, four complementary techniques: falsification evaluation, verification, runtime monitoring, regulations ethical use. total, 370+ references are considered support quick understanding issues from perspective V&V. While intensive research has conducted identify yet practical methods called ensure requirements.

Language: Английский

Citations

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks DOI

Zhengyan Zhang, Guangxuan Xiao, Yongwei Li

et al.

Deleted Journal, Journal Year: 2023, Volume and Issue: 20(2), P. 180 - 193

Published: March 2, 2023

Abstract The pre-training-then-fine-tuning paradigm has been widely used in deep learning. Due to the huge computation cost for pre-training, practitioners usually download pre-trained models from Internet and fine-tune them on downstream datasets, while downloaded may suffer backdoor attacks. Different previous attacks aiming at a target task, we show that backdoored model can behave maliciously various tasks without foreknowing task information. Attackers restrict output representations (the values of neurons) trigger-embedded samples arbitrary predefined through additional training, namely neuron-level attack (NeuBA). Since fine-tuning little effect parameters, fine-tuned will retain functionality predict specific label embedded with same trigger. To provoke multiple labels attackers introduce several triggers contrastive values. In experiments both natural language processing (NLP) computer vision (CV), NeuBA well control predictions instances different trigger designs. Our findings sound red alarm wide use models. Finally, apply defense methods find pruning is promising technique resist by omitting neurons.

Language: Английский

Citations

Attention-Enhancing Backdoor Attacks Against BERT-based Models DOI

Weimin Lyu,

Songzhu Zheng,

Lu Pang

et al.

Published: Jan. 1, 2023

Recent studies have revealed that Backdoor Attacks can threaten the safety of natural language processing (NLP) models. Investigating strategies backdoor attacks will help to understand model's vulnerability. Most existing textual focus on generating stealthy triggers or modifying model weights. In this paper, we directly target interior structure neural networks and mechanism. We propose a novel Trojan Attention Loss (TAL), which enhances behavior by manipulating attention patterns. Our loss be applied different attacking methods boost their attack efficacy in terms successful rates poisoning rates. It applies not only traditional dirty-label attacks, but also more challenging clean-label attacks. validate our method backbone models (BERT, RoBERTa, DistilBERT) various tasks (Sentiment Analysis, Toxic Detection, Topic Classification).

Language: Английский

Citations

Backdoor Attacks and Defenses Targeting Multi-Domain AI Models: A Comprehensive Review DOI

Shaobo Zhang, Yizhen Pan, Qin Liu

et al.

ACM Computing Surveys, Journal Year: 2024, Volume and Issue: 57(4), P. 1 - 35

Published: Nov. 15, 2024

Since the emergence of security concerns in artificial intelligence (AI), there has been significant attention devoted to examination backdoor attacks. Attackers can utilize attacks manipulate model predictions, leading potential harm. However, current research on and defenses both theoretical practical fields still many shortcomings. To systematically analyze these shortcomings address lack comprehensive reviews, this article presents a systematic summary targeting multi-domain AI models. Simultaneously, based design principles shared characteristics triggers different domains implementation stages defense, proposes new classification method for defenses. We use extensively review computer vision natural language processing, we also examine applications audio recognition, video action multimodal tasks, time series generative learning, reinforcement while critically analyzing open problems various attack techniques defense strategies. Finally, builds upon analysis state further explore future directions

Language: Английский

Citations

A steganographic backdoor attack scheme on encrypted traffic DOI

B.V.A. Rao,

Guiqin Zhu,

Qiaolong Ding

et al.

Peer-to-Peer Networking and Applications, Journal Year: 2025, Volume and Issue: 18(2)

Published: Jan. 17, 2025

Language: Английский

Citations

Piccolo: Exposing Complex Backdoors in NLP Transformer Models DOI

Yingqi Liu,

Guangyu Shen,

Guanhong Tao

et al.

2022 IEEE Symposium on Security and Privacy (SP), Journal Year: 2022, Volume and Issue: unknown, P. 2025 - 2042

Published: May 1, 2022

Backdoors can be injected to NLP models such that they misbehave when the trigger words or sentences appear in an input sample. Detecting backdoors given only a subject model and small number of benign samples is very challenging because unique nature applications, as discontinuity pipeline large search space. Existing techniques work well for with simple triggers single character/word but become less effective complex (e.g., transformer models). We propose new backdoor scanning technique. It transforms equivalent differentiable form. then uses optimization invert distribution denoting their likelihood trigger. leverages novel word discriminativity analysis determine if particularly discriminative presence likely words. Our evaluation on 3839 from TrojAI competition existing works 7 state-of-art structures BERT GPT, 17 different attack types including two latest dynamic attacks, shows our technique highly effective, achieving over 0.9 detection accuracy most scenarios substantially outperforming state-of-the-art scanners. submissions leaderboard achieve top performance 2 out 3 rounds scanning.

Language: Английский

Citations

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks DOI

Xiangyu Qi,

Tinghao Xie,

Ruizhe Pan

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2022, Volume and Issue: unknown, P. 13337 - 13347

Published: June 1, 2022

One major goal of the AI security community is to securely and reliably produce deploy deep learning models for real-world applications. To this end, data poisoning based backdoor attacks on neural networks (DNNs) in production stage (or training stage) corresponding defenses are extensively explored recent years. Ironically, deployment stage, which can often happen unprofessional users' devices thus arguably far more threatening scenarios, draw much less attention community. We attribute imbalance vigilance weak practicality existing deployment-stage attack algorithms insufficiency demonstrations. fill blank, work, we study realistic threat DNNs. base our a commonly used paradigm - adversarial weight attack, where adversaries selectively modify model weights embed into deployed approach practicality, propose first gray-box physically realizable algorithm injection, namely subnet replacement (SRA), only requires architecture information victim support physical triggers real world. Extensive experimental simulations system-level real- world demonstrations conducted. Our results not suggest effectiveness proposed algorithm, but also reveal practical risk novel type computer virus that may widely spread stealthily inject DNN user devices. By study, call vulnerability DNNs stage.

Language: Английский

Citations

Multi-target Backdoor Attacks for Code Pre-trained Models DOI

Yanzhou Li, Shangqing Liu, Kangjie Chen

et al.

Published: Jan. 1, 2023

Backdoor attacks for neural code models have gained considerable attention due to the advancement of intelligence. However, most existing works insert triggers into task-specific data code-related downstream tasks, thereby limiting scope attacks. Moreover, majority pre-trained are designed understanding tasks. In this paper, we propose task-agnostic backdoor models. Our backdoored model is with two learning strategies (i.e., Poisoned Seq2Seq and token representation learning) support multi-target attack generation During deployment phase, implanted backdoors in victim can be activated by achieve targeted attack. We evaluate our approach on tasks three over seven datasets. Extensive experimental results demonstrate that effectively stealthily

Language: Английский

Citations

Detecting Backdoors in Pre-trained Encoders DOI

Shiwei Feng, Guanhong Tao, Siyuan Cheng

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 16352 - 16362

Published: June 1, 2023

Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained clean data) may inherit behaviors from en-coders. Existing detection methods mainly focus supervised settings and cannot handle pre-trained especially when labels are not available. In this paper, we propose DECREE, the first back-door approach encoders, requiring neither classifier headers nor labels. We evaluate DECREE over 400 trojaned under 3 paradigms. show effectiveness our method ImageNet OpenAI's CLIP million image-text pairs. Our consistently has a high accuracy even if have only limited no access pre-training dataset. Code is available at https://github.com/GiantSeaweed/DECREE.

Language: Английский

Citations

Training-free Lexical Backdoor Attacks on Language Models DOI

Yujin Huang, Terry Yue Zhuo, Qiongkai Xu

et al.

Proceedings of the ACM Web Conference 2022, Journal Year: 2023, Volume and Issue: unknown, P. 2198 - 2208

Published: April 26, 2023

Large-scale language models have achieved tremendous success across various natural processing (NLP) applications. Nevertheless, are vulnerable to backdoor attacks, which inject stealthy triggers into for steering them undesirable behaviors. Most existing such as data poisoning, require further (re)training or fine-tuning learn the intended patterns. The additional training process however diminishes stealthiness of a model usually requires long optimization time, massive amount data, and considerable modifications parameters.

Language: Английский

Citations