Cited by Adversarial machine learning: a review of methods, tools, and critical industry sectors

Training-free Lexical Backdoor Attacks on Language Models DOI

Yujin Huang, Terry Yue Zhuo, Qiongkai Xu

и другие.

Proceedings of the ACM Web Conference 2022, Год журнала: 2023, Номер unknown, С. 2198 - 2208

Опубликована: Апрель 26, 2023

Large-scale language models have achieved tremendous success across various natural processing (NLP) applications. Nevertheless, are vulnerable to backdoor attacks, which inject stealthy triggers into for steering them undesirable behaviors. Most existing such as data poisoning, require further (re)training or fine-tuning learn the intended patterns. The additional training process however diminishes stealthiness of a model usually requires long optimization time, massive amount data, and considerable modifications parameters.

Язык: Английский

Процитировано

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models DOI

Kangjie Chen,

Yuxian Meng,

Xiaofei Sun

и другие.

arXiv (Cornell University), Год журнала: 2021, Номер unknown

Опубликована: Янв. 1, 2021

Pre-trained Natural Language Processing (NLP) models can be easily adapted to a variety of downstream language tasks. This significantly accelerates the development models. However, NLP have been shown vulnerable backdoor attacks, where pre-defined trigger word in input text causes model misprediction. Previous attacks mainly focus on some specific makes those less general and applicable other kinds In this work, we propose \Name, first task-agnostic attack against pre-trained The key feature our is that adversary does not need prior information about tasks when implanting model. When malicious released, any transferred from it will also inherit backdoor, even after extensive transfer learning process. We further design simple yet effective strategy bypass state-of-the-art defense. Experimental results indicate approach compromise wide range an stealthy way.

Язык: Английский

Процитировано

TextGuard: Provable Defense against Backdoor Attacks on Text Classification DOI

Hengzhi Pei, Jinyuan Jia, Wenbo Guo

и другие.

Опубликована: Янв. 1, 2024

Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications.Existing research endeavors proposed many defenses against backdoor attacks.Despite demonstrating certain empirical defense efficacy, none of these techniques could provide formal and provable guarantee arbitrary attacks.As result, they can be easily broken by strong adaptive attacks, as shown our evaluation.In this work, we propose TextGuard, the first on text classification.In particular, TextGuard divides (backdoored) training data into sub-training sets, achieved splitting each sentence sub-sentences.This partitioning ensures that majority sets do not contain trigger.Subsequently, base classifier is trained from set, their ensemble provides final prediction.We theoretically prove when length trigger falls within threshold, guarantees its prediction will remain unaffected presence triggers testing inputs.In evaluation, demonstrate effectiveness three benchmark classification tasks, surpassing certification accuracy existing certified attacks.Furthermore, additional strategies to enhance performance TextGuard.Comparisons with state-ofthe-art validate superiority countering multiple attacks.

Язык: Английский

Процитировано

Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challenges DOI

Pranjal Kumar

International Journal of Multimedia Information Retrieval, Год журнала: 2024, Номер 13(3)

Опубликована: Июнь 25, 2024

Язык: Английский

Процитировано

Robust Backdoor Detection for Deep Learning via Topological Evolution Dynamics DOI

Xiaoxing Mo, Yechao Zhang, Leo Yu Zhang

и другие.

2022 IEEE Symposium on Security and Privacy (SP), Год журнала: 2024, Номер 4, С. 2048 - 2066

Опубликована: Май 19, 2024

Язык: Английский

Процитировано

SeqXGPT: Sentence-Level AI-Generated Text Detection DOI

Pengyu Wang,

Linyang Li,

Ke Ren

и другие.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Год журнала: 2023, Номер unknown, С. 1144 - 1156

Опубликована: Янв. 1, 2023

Widely applied large language models (LLMs) can generate human-like content, raising concerns about the abuse of LLMs. Therefore, it is important to build strong AI-generated text (AIGT) detectors. Current works only consider document-level AIGT detection, therefore, in this paper, we first introduce a sentence-level detection challenge by synthesizing dataset that contains documents are polished with LLMs, is, contain sentences written humans and modified Then propose Sequence X (Check) GPT, novel method utilizes log probability lists from white-box LLMs as features for detection. These composed like waves speech processing cannot be studied SeqXGPT based on convolution self-attention networks. We test both sentence challenges. Experimental results show previous methods struggle solving while our not significantly surpasses baseline challenges but also exhibits generalization capabilities.

Язык: Английский

Процитировано

Backdoor Attacks to Deep Learning Models and Countermeasures: A Survey DOI

Yudong Li, Shigeng Zhang, Weiping Wang

и другие.

IEEE Open Journal of the Computer Society, Год журнала: 2023, Номер 4, С. 134 - 146

Опубликована: Янв. 1, 2023

Backdoor attacks have severely threatened deep neural network (DNN) models in the past several years. In backdoor attacks, attackers try to plant hidden backdoors into DNN models, either training or inference stage, mislead output of model when input contains some specified triggers without affecting prediction normal inputs not containing triggers. As a rapidly developing topic, numerous works on designing various and techniques defend against such been proposed recent However, comprehensive holistic overview countermeasures is still missing. this paper, we provide systematic design defense strategies covering latest published works. We review representative both computer vision domain other domains, discuss their pros cons, make comparisons among them. outline key challenges be addressed potential research directions future.

Язык: Английский

Процитировано

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors DOI

Chengkun Wei, Wenlong Meng, Zhikun Zhang

и другие.

Опубликована: Янв. 1, 2024

Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite wide adoption, we empirically show that prompt-tuning is vulnerable task-agnostic backdoors, which reside in the pretrained can affect arbitrary tasks. The state-of-the-art backdoor detection approaches cannot defend against backdoors since they hardly converge reversing triggers. To address this issue, propose LMSanitator, a novel approach detecting removing on Transformer models. Instead of directly inverting triggers, LMSanitator aims invert predefined attack vectors (pretrained models' output when input embedded with triggers) achieves much better convergence accuracy. further leverages prompt-tuning's property freezing model perform accurate fast monitoring purging during inference phase. Extensive experiments multiple NLP tasks illustrate effectiveness LMSanitator. For instance, 92.8% accuracy 960 decreases success rate less than 1% most scenarios.

Язык: Английский

Процитировано

TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models DOI

Weichuan Mo,

Kongyang Chen,

Yatie Xiao

и другие.

Mathematics, Год журнала: 2025, Номер 13(2), С. 272 - 272

Опубликована: Янв. 15, 2025

Pre-trained language models such as BERT, GPT-3, and T5 have made significant advancements in natural processing (NLP). However, their widespread adoption raises concerns about intellectual property (IP) protection, unauthorized use can undermine innovation. Watermarking has emerged a promising solution for model ownership verification, but its application to NLP presents unique challenges, particularly ensuring robustness against fine-tuning preventing interference with downstream tasks. This paper novel watermarking scheme, TIBW (Task-Independent Backdoor Watermarking), that embeds robust, task-independent backdoor watermarks into pre-trained models. By implementing Trigger–Target Word Pair Search Algorithm selects trigger–target word pairs maximal semantic dissimilarity, our approach ensures the watermark remains effective even after extensive fine-tuning. Additionally, we introduce Parameter Relationship Embedding (PRE) subtly modify model’s embedding layer, reinforcing association between trigger target words without degrading performance. We also design comprehensive verification process evaluates task behavior consistency, quantified by Watermark Success Rate (WESR). Our experiments across five benchmark tasks demonstrate proposed method maintains near-baseline performance on clean inputs while achieving high WESR, outperforming existing baselines both stealthiness. Furthermore, persists reliably additional fine-tuning, highlighting resilience potential removal attempts. work provides secure reliable IP protection mechanism models, integrity diverse applications.

Язык: Английский

Процитировано

Llmbd: Backdoor Defense Via Large Language Model Paraphrasing and Data Voting in Nlp DOI

Fei Ouyang, Di Zhang,

Chunlong Xie

и другие.

Опубликована: Янв. 1, 2025

Процитировано