Skip to content

NYU CTF Bench Setup

NYU CTF Bench is a collection of CTF challenges from the CSAW CTF competitions, designed for evaluation of LLM agents. The CTF challenges are dockerized and easily deployable to allow an LLM-based automation framework to interact with the challenge and attempt a solution. The main benchmark dataset contains 200 challenges across 6 CTF categories: web, binary exploitation (pwn), forensics, reverse engineering (rev), cryptography (crypto), and miscellaneous (misc).

Benchmark structure

The test/ folder contains the main benchmark dataset of 200 challenges. A smaller development set of 55 challenges is present in the development/ folder. The development set can be treated equivalent to a "train" split and used for building the agent, so that design decisions made to improve the agent do not bias the test scores.

The folder structure is as follows: <year>/<event>/<category>/<challenge>. <year> is the year of the competition, <event> is either "CSAW-Quals" or "CSAW-Finals", <category> is among the 6 categories, and <challenge> is the challenge name. Note that the challenge name may have spaces and single-quotes, so it is advisable to wrap it in double-quotes when using in scripts.

Each challenge contains a challenge.json containing the metadata of the challenge, and the corresponding challenge files. Challenges that require a server to host some challenge files are set up with a docker image, and a docker-compose.yaml file. The docker image is loaded directly using docker compose up.

Setup

Clone the dataset:

git clone https://github.com/NYU-LLM-CTF/NYU_CTF_Bench.git

Install the python package:

pip install nyuctf

The repository is automatically cloned when the CTFDataset is first instantiated with the split argument. If needed, you can manually clone it by running:

python3 -m nyuctf.download

Usage

The following python snippet shows how to load challenge details using the python module:

from nyuctf.dataset import CTFDataset
from nyuctf.challenge import CTFChallenge

# Clones the repository for the first time, which takes a while
ds = CTFDataset(split="test")
chal = CTFChallenge(ds.get("2021f-rev-maze"), ds.basedir)

print(chal.name)
print(chal.flag)
print(chal.files)

Tests

Run tests on the challenges, for docker setup and network connection. Requires the docker network to be setup.

cd python
python -m unittest -v test.test_challenges

Optionally filter the tests with the unittest -k flag.