NYU CTF
Bench

A benchmark of CTF challenges to test LLM capabilities in cybersecurity

NeurIPS'24 Datasets and Benchmarks

Minghao Shao*, Sofija Jancheska*, Meet Udeshi*, Brendan Dolan-Gavitt*,
Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg,
Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

The NYU CTF Bench is designed to evaluate cybersecurity capabilities of LLM agents. We provide difficult real-world CTF challenges to facilitate research in improving LLMs at interactive cybersecurity tasks and complex automated task planning. Evaluating LLM agents on the NYU CTF challenges yields insights into their potential for AI-driven cybersecurity to perform real-world threat management. Check out the paper for details.

@misc{shao2024nyuctf,
    title={NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security},
    author={Minghao Shao and Sofija Jancheska and Meet Udeshi and Brendan Dolan-Gavitt and Haoran Xi and Kimberly Milner and Boyuan Chen and Max Yin and Siddharth Garg and Prashanth Krishnamurthy and Farshad Khorrami and Ramesh Karri and Muhammad Shafique},
    year={2024},
    eprint={2406.05590},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url={https://arxiv.org/abs/2406.05590},
}

Leaderboard

#	Agent	Model	Score	Logs	Link

How to Submit

All submissions are managed at the leaderboard submissions github repository. Follow the README on the repository to make a submission.