
Artificial Intelligence Review, Год журнала: 2025, Номер 58(8)
Опубликована: Май 3, 2025
Язык: Английский
Artificial Intelligence Review, Год журнала: 2025, Номер 58(8)
Опубликована: Май 3, 2025
Язык: Английский
Proceedings of the ACM Web Conference 2022, Год журнала: 2023, Номер unknown, С. 2198 - 2208
Опубликована: Апрель 26, 2023
Large-scale language models have achieved tremendous success across various natural processing (NLP) applications. Nevertheless, are vulnerable to backdoor attacks, which inject stealthy triggers into for steering them undesirable behaviors. Most existing such as data poisoning, require further (re)training or fine-tuning learn the intended patterns. The additional training process however diminishes stealthiness of a model usually requires long optimization time, massive amount data, and considerable modifications parameters.
Язык: Английский
Процитировано
16arXiv (Cornell University), Год журнала: 2021, Номер unknown
Опубликована: Янв. 1, 2021
Pre-trained Natural Language Processing (NLP) models can be easily adapted to a variety of downstream language tasks. This significantly accelerates the development models. However, NLP have been shown vulnerable backdoor attacks, where pre-defined trigger word in input text causes model misprediction. Previous attacks mainly focus on some specific makes those less general and applicable other kinds In this work, we propose \Name, first task-agnostic attack against pre-trained The key feature our is that adversary does not need prior information about tasks when implanting model. When malicious released, any transferred from it will also inherit backdoor, even after extensive transfer learning process. We further design simple yet effective strategy bypass state-of-the-art defense. Experimental results indicate approach compromise wide range an stealthy way.
Язык: Английский
Процитировано
28Опубликована: Янв. 1, 2024
Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications.Existing research endeavors proposed many defenses against backdoor attacks.Despite demonstrating certain empirical defense efficacy, none of these techniques could provide formal and provable guarantee arbitrary attacks.As result, they can be easily broken by strong adaptive attacks, as shown our evaluation.In this work, we propose TextGuard, the first on text classification.In particular, TextGuard divides (backdoored) training data into sub-training sets, achieved splitting each sentence sub-sentences.This partitioning ensures that majority sets do not contain trigger.Subsequently, base classifier is trained from set, their ensemble provides final prediction.We theoretically prove when length trigger falls within threshold, guarantees its prediction will remain unaffected presence triggers testing inputs.In evaluation, demonstrate effectiveness three benchmark classification tasks, surpassing certification accuracy existing certified attacks.Furthermore, additional strategies to enhance performance TextGuard.Comparisons with state-ofthe-art validate superiority countering multiple attacks.
Язык: Английский
Процитировано
5International Journal of Multimedia Information Retrieval, Год журнала: 2024, Номер 13(3)
Опубликована: Июнь 25, 2024
Язык: Английский
Процитировано
52022 IEEE Symposium on Security and Privacy (SP), Год журнала: 2024, Номер 4, С. 2048 - 2066
Опубликована: Май 19, 2024
Язык: Английский
Процитировано
5Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Год журнала: 2023, Номер unknown, С. 1144 - 1156
Опубликована: Янв. 1, 2023
Widely applied large language models (LLMs) can generate human-like content, raising concerns about the abuse of LLMs. Therefore, it is important to build strong AI-generated text (AIGT) detectors. Current works only consider document-level AIGT detection, therefore, in this paper, we first introduce a sentence-level detection challenge by synthesizing dataset that contains documents are polished with LLMs, is, contain sentences written humans and modified Then propose Sequence X (Check) GPT, novel method utilizes log probability lists from white-box LLMs as features for detection. These composed like waves speech processing cannot be studied SeqXGPT based on convolution self-attention networks. We test both sentence challenges. Experimental results show previous methods struggle solving while our not significantly surpasses baseline challenges but also exhibits generalization capabilities.
Язык: Английский
Процитировано
10IEEE Open Journal of the Computer Society, Год журнала: 2023, Номер 4, С. 134 - 146
Опубликована: Янв. 1, 2023
Backdoor attacks have severely threatened deep neural network (DNN) models in the past several years. In backdoor attacks, attackers try to plant hidden backdoors into DNN models, either training or inference stage, mislead output of model when input contains some specified triggers without affecting prediction normal inputs not containing triggers. As a rapidly developing topic, numerous works on designing various and techniques defend against such been proposed recent However, comprehensive holistic overview countermeasures is still missing. this paper, we provide systematic design defense strategies covering latest published works. We review representative both computer vision domain other domains, discuss their pros cons, make comparisons among them. outline key challenges be addressed potential research directions future.
Язык: Английский
Процитировано
9Опубликована: Янв. 1, 2024
Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite wide adoption, we empirically show that prompt-tuning is vulnerable task-agnostic backdoors, which reside in the pretrained can affect arbitrary tasks. The state-of-the-art backdoor detection approaches cannot defend against backdoors since they hardly converge reversing triggers. To address this issue, propose LMSanitator, a novel approach detecting removing on Transformer models. Instead of directly inverting triggers, LMSanitator aims invert predefined attack vectors (pretrained models' output when input embedded with triggers) achieves much better convergence accuracy. further leverages prompt-tuning's property freezing model perform accurate fast monitoring purging during inference phase. Extensive experiments multiple NLP tasks illustrate effectiveness LMSanitator. For instance, 92.8% accuracy 960 decreases success rate less than 1% most scenarios.
Язык: Английский
Процитировано
3Mathematics, Год журнала: 2025, Номер 13(2), С. 272 - 272
Опубликована: Янв. 15, 2025
Pre-trained language models such as BERT, GPT-3, and T5 have made significant advancements in natural processing (NLP). However, their widespread adoption raises concerns about intellectual property (IP) protection, unauthorized use can undermine innovation. Watermarking has emerged a promising solution for model ownership verification, but its application to NLP presents unique challenges, particularly ensuring robustness against fine-tuning preventing interference with downstream tasks. This paper novel watermarking scheme, TIBW (Task-Independent Backdoor Watermarking), that embeds robust, task-independent backdoor watermarks into pre-trained models. By implementing Trigger–Target Word Pair Search Algorithm selects trigger–target word pairs maximal semantic dissimilarity, our approach ensures the watermark remains effective even after extensive fine-tuning. Additionally, we introduce Parameter Relationship Embedding (PRE) subtly modify model’s embedding layer, reinforcing association between trigger target words without degrading performance. We also design comprehensive verification process evaluates task behavior consistency, quantified by Watermark Success Rate (WESR). Our experiments across five benchmark tasks demonstrate proposed method maintains near-baseline performance on clean inputs while achieving high WESR, outperforming existing baselines both stealthiness. Furthermore, persists reliably additional fine-tuning, highlighting resilience potential removal attempts. work provides secure reliable IP protection mechanism models, integrity diverse applications.
Язык: Английский
Процитировано
0Опубликована: Янв. 1, 2025
Процитировано
0