COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Abstract

Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent (suffix) attack with continuation constraint, but also allow us to address new controllable attack settings such as revising a user query adversarially with paraphrasing constraint, and inserting stealthy attacks in context with position constraint. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability.

COLD-Attack Framework

The following diagram shows an overview of COLD-Attack framework.

As illustrated in the above diagram, our COLD-Attack framework includes three main steps:

Energy function formulation: Specify energy functions properly to capture the attack constraints such as fluency, stealthiness, sentiment, and left-right-coherence.
Langevin dynamics sampling: Run Langevin dynamics recursively for specific steps to obtain a good energy-based model governing the adversarial attack logits.
Decoding process: Leverage an LLM-guided decoding process to convert the continuous logits into discrete text attacks.

Attack Settings

The COLD-Attack framework unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios including:

Attack with Continuation Constraint: appending the adversarial prompt to the original malicious user query.
Attack with Paraphrasing Constraint: revising a user query adversarially with minimal paraphrasing.
Attack with Position Constraint: inserting stealthy attacks in context with left-right-coherence.

Here are some examples that generated by COLD-Attack:

Experiment Results

Attack with Continuation Constraint

COLD-Attack achieves comparable ASRs compared to the existing baseline methods such as GCG, AutoDAN-zhu, and AutoDAN-liu.
COLD-Attack stands out by achieving better stealthiness with lower PPL compared to all other methods.

We adopt Distinct N-grams Score (DNS), Averaged distinct N-grams (ADN), and Self-BLEU to evaluate the diversity of the generated advaserial prompt.
COLD-Attack can generate more diverse adversarial prompt compared to the baseline methods.
We report the per-sample running time (minutes) for COLD-Attack and baseline methods using a single NVIDIA V100 GPU.
COLD-Attack is much faster than GCG and AutoDAN-zhu.

Attack with Paraphrasing Constraint

COLD-Attack produces high-quality rephrasing.
COLD-Attack significantly outperforms other three paraphrasing baselines in terms of ASR.
Our paraphrase attack with sentiment control reveals that different LLMs exhibit varying susceptibilities to different sentiments.

Attack with Position Constraint

COLD-Attack can effectively generate stealthy attacks that satisfy the position constraint.
COLD-Attack also allows the use of separate prompts to pose output constraints on the target LLMs.

COLD-Attack outperforms under the position constraint setting compared to other baseline methods including AutoDAN-Zhu and GCG.

BibTeX

@article{guo2024cold,
  author    = {Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin},
  title     = {COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability},
  journal   = {arXiv preprint arXiv:2402.08679},
  year      = {2024},
}