We should have a tool to be able to track Github project activity…

Papers collection:

Techniques:

  • Input
    • heuristics
    • fine-tuned models
      • chunked detection of inputs?
    • prompt rewriting
  • Output filtering
    • heuristics
    • canary tokens
    • fine-tuned models
  • Logging
    • embeddings/vector db storage of attacks

rebuff

  • Apache 2.0
  • Heuristics: Filter out potentially malicious input before it reaches the LLM.
  • LLM-based detection: Use a dedicated LLM to analyze incoming prompts and identify potential attacks.
  • VectorDB: Store embeddings of previous attacks in a vector database to recognize and prevent similar attacks in the future.
  • Canary tokens: Add canary tokens to prompts to detect leakages, allowing the framework to store embeddings about the incoming prompt in the vector database and prevent future attacks.
pip install rebuff

vigil

Vigil is a Python library and REST API for assessing Large Language Model prompts and responses against a set of scanners to detect prompt injections, jailbreaks, and other potential threats. This repository also provides the detection signatures and datasets needed to get started with self-hosting.

llm-guard

pip install llm-guard

langkit

LangKit is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

  • Apache 2.0

The out of the box metrics include:

  • Text Quality
    • readability score
    • complexity and grade scores
  • Text Relevance
    • Similarity scores between prompt/responses
    • Similarity scores against user-defined themes
  • Security and Privacy
    • patterns - count of strings matching a user-defined regex pattern group
    • jailbreaks - similarity scores with respect to known jailbreak attempts
    • prompt injection - similarity scores with respect to known prompt injection attacks
    • hallucinations - consistency check between responses
    • refusals - similarity scores with respect to known LLM refusal of service responses
  • Sentiment and Toxicity
    • sentiment analysis
    • toxicity analysis
pip install langkit[all]

promptmap

promptmap is a tool that automatically tests prompt injection attacks on ChatGPT instances. It analyzes your ChatGPT rules to understand its context and purpose. This understanding is used to generate creative attack prompts tailored for the target. promptmap then run a ChatGPT instance with the system prompts provided by you and sends attack prompts to it. It can determine whether the prompt injection attack was successful by checking the answer coming from your ChatGPT instance.

  • MIT

llm-security

This repository contains scripts and related documentation that demonstrate attacks against large language models using repeated character sequences. These techniques can be used to execute prompt injection on content-constrained LLM queries.

Open-Prompt-Injection

  • No license For attacks, clients can use one of the following key words: naive, escape, ignore, fake_comp, and combine. Each of they corresponds one attack strategy mentioned in the paper.

For defenses, specifying the following key words when creating the app:

  1. By default, “no” is used, meaning that there is no defense used.
  2. Paraphrasing: “paraphrasing”
  3. Retokenization: “retokenization”
  4. Data prompt isolation: “delimiters”, “xml”, or “random_seq”
  5. Instructional prevention: “instructional”
  6. Sandwich prevention: “sandwich”
  7. Perplexity-based detection: use “ppl-[window_size]-[threshold]“. When this is for non-windowed PPL detection, use “ppl-all-[threshold]“. For example, “ppl-all-3.0” means the PPL detector without using windows when the threshold is 3.0. Another example is that “ppl-5-3.5” means to use a windowed PPL detector with threshold being 3.5.
  8. LLM-based detection: “llm-based”
  9. Response-based detection: “response-based”
  10. Proactive detection: “proactive”

Clients are recommended to navigate to ./configs/model_configs/ to check the supported LLMs. Clients should also enter their own PaLM2 API keys in the corresponding areas in the model config. Supports for other models will be added later.

Prompt-adversarial collections

  • MIT This repository serves as a comprehensive resource on the study and practice of prompt-injection attacks, defenses, and interesting examples. It contains a collection of examples, case studies, and detailed notes aimed at researchers, students, and security professionals interested in this topic.