We should have a tool to be able to track Github project activity…

Papers collection:

https://huggingface.co/collections/leonardlin/prompt-injection-65dd93985012ec503f2a735a

Techniques:

Input
- heuristics
- fine-tuned models
  - chunked detection of inputs?
- prompt rewriting
Output filtering
- heuristics
- canary tokens
- fine-tuned models
Logging
- embeddings/vector db storage of attacks

rebuff

Apache 2.0
Heuristics: Filter out potentially malicious input before it reaches the LLM.
LLM-based detection: Use a dedicated LLM to analyze incoming prompts and identify potential attacks.
VectorDB: Store embeddings of previous attacks in a vector database to recognize and prevent similar attacks in the future.
Canary tokens: Add canary tokens to prompts to detect leakages, allowing the framework to store embeddings about the incoming prompt in the vector database and prevent future attacks.

pip install rebuff

vigil

Vigil is a Python library and REST API for assessing Large Language Model prompts and responses against a set of scanners to detect prompt injections, jailbreaks, and other potential threats. This repository also provides the detection signatures and datasets needed to get started with self-hosting.

Analyze LLM prompts for common injections and risky inputs
Use Vigil as a Python library or REST API
Scanners are modular and easily extensible
Evaluate detections and pipelines with Vigil-Eval (coming soon)
Available scan modules
- Vector database / text similarity
  - Auto-updating on detected prompts
- Heuristics via YARA
- Transformer model
- Prompt-response similarity
- Canary Tokens
- Sentiment analysis
- Relevance (via LiteLLM)
- Paraphrasing
Supports local embeddings and/or OpenAI
Signatures and embeddings for common attacks
Custom detections via YARA signatures
Streamlit web UI playground

llm-guard

MIT
Input/Output scanning, eg:
- Prompt scanners
  - Anonymize
  - BanCompetitors
  - BanSubstrings
  - BanTopics
  - Code
  - Gibberish
  - InvisibleText
  - Language
  - PromptInjection
  - Regex
  - Secrets
  - Sentiment
  - TokenLimit
  - Toxicity
- Output scanners
  - BanCompetitors
  - BanSubstrings
  - BanTopics
  - Bias
  - Code
  - Deanonymize
  - JSON
  - Language
  - LanguageSame
  - MaliciousURLs
  - NoRefusal
  - ReadingTime
  - FactualConsistency
  - Gibberish
  - Regex
  - Relevance
  - Sensitive
  - Sentiment
  - Toxicity
  - URLReachability

pip install llm-guard

langkit

LangKit is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

Apache 2.0

The out of the box metrics include:

Text Quality
- readability score
- complexity and grade scores
Text Relevance
- Similarity scores between prompt/responses
- Similarity scores against user-defined themes
Security and Privacy
- patterns - count of strings matching a user-defined regex pattern group
- jailbreaks - similarity scores with respect to known jailbreak attempts
- prompt injection - similarity scores with respect to known prompt injection attacks
- hallucinations - consistency check between responses
- refusals - similarity scores with respect to known LLM refusal of service responses
Sentiment and Toxicity
- sentiment analysis
- toxicity analysis

pip install langkit[all]

promptmap

promptmap is a tool that automatically tests prompt injection attacks on ChatGPT instances. It analyzes your ChatGPT rules to understand its context and purpose. This understanding is used to generate creative attack prompts tailored for the target. promptmap then run a ChatGPT instance with the system prompts provided by you and sends attack prompts to it. It can determine whether the prompt injection attack was successful by checking the answer coming from your ChatGPT instance.

llm-security

This repository contains scripts and related documentation that demonstrate attacks against large language models using repeated character sequences. These techniques can be used to execute prompt injection on content-constrained LLM queries.

Open-Prompt-Injection

No license For attacks, clients can use one of the following key words: naive, escape, ignore, fake_comp, and combine. Each of they corresponds one attack strategy mentioned in the paper.

For defenses, specifying the following key words when creating the app:

By default, “no” is used, meaning that there is no defense used.
Paraphrasing: “paraphrasing”
Retokenization: “retokenization”
Data prompt isolation: “delimiters”, “xml”, or “random_seq”
Instructional prevention: “instructional”
Sandwich prevention: “sandwich”
Perplexity-based detection: use “ppl-[window_size]-[threshold]“. When this is for non-windowed PPL detection, use “ppl-all-[threshold]“. For example, “ppl-all-3.0” means the PPL detector without using windows when the threshold is 3.0. Another example is that “ppl-5-3.5” means to use a windowed PPL detector with threshold being 3.5.
LLM-based detection: “llm-based”
Response-based detection: “response-based”
Proactive detection: “proactive”

Clients are recommended to navigate to ./configs/model_configs/ to check the supported LLMs. Clients should also enter their own PaLM2 API keys in the corresponding areas in the model config. Supports for other models will be added later.

Prompt-adversarial collections

MIT This repository serves as a comprehensive resource on the study and practice of prompt-injection attacks, defenses, and interesting examples. It contains a collection of examples, case studies, and detailed notes aimed at researchers, students, and security professionals interested in this topic.

📖 llm-tracker

Explorer

Prompt Injection Protection