Benchmarking Correctness and Security in Multi-Turn Code Generation

Ruchit Rawal   Jeffrey Yang Fan Chiang   Chihao Shen   Jeffery Siyuan Tian   Aastha Mahajan   Tom Goldstein   Yizheng Chen  
University of Maryland, College Park

MT-Sec Leaderboard

Rank Model Multi-Turn (MT)
Correct&Secure (%)
(higher better)
Correct&Insecure (%)
(lower better)
1 GPT-5T(Aider) 35.8 13.0
2 GPT-5T(OpenHands) 34.2 15.9
3 Claude Opus 4T 40.1 13.1
4 GPT-5T 39.7 12.2
5 GPT-5T(Codex) 36.2 15.0
6 O4 MiniT 38.3 11.1
7 Claude Sonnet 4T 38.8 13.4
8 O3T 37.0 10.7
9 GPT-5T MiniT 39.2 12.0
10 Gemini 2.5 ProT 36.4 11.5
11 O3T MiniT 38.3 11.5
12 O1T 36.6 11.8
13 Claude 3.7 SonnetT 38.0 12.9
14 DeepSeek-R1T 33.9 11.4
15 GPT-4.1 35.7 10.9
16 Claude 3.7 Sonnet 35.4 12.9
17 GPT-4o 30.6 11.0
18 O1T MiniT 34.7 10.1
19 DeepSeek-V3 34.5 12.1
20 Claude 3.5 Sonnet 28.9 9.9
21 Qwen-2.5 Coder32B 29.4 8.8
22 Qwen-314B 19.8 10.1
23 Qwen-2.5 Coder14B 24.3 8.6
24 Gemini 2.5 FlashT 23.1 8.2
25 Qwen-38BT 19.6 9.5
26 Qwen-34BT 16.4 8.8
27 Qwen-2.5 Coder7B 17.7 9.8
28 Qwen-34B 16.1 9.6
29 Qwen-38B 18.1 9.8
30 Qwen-2.5 Coder3B 11.4 9.9
31 Qwen-31.7BT 11.3 8.2
32 Qwen-31.7B 9.4 8.5
33 Qwen-30.6BT 4.2 7.0
34 Qwen-30.6B 3.6 7.4
35 Qwen-2.5 Coder0.5B 3.9 6.3
Rank Model Single-Turn MT-Expansion MT-Editing MT-Refactor
C&S (%) C&I (%) C&S (%) C&I (%) C&S (%) C&I (%) C&S (%) C&I (%)
1 GPT-5T(Aider) 53.0 14.8 25.7 14.8 38.8 13.8 43.0 10.4
2 GPT-5T(OpenHands) 52.5 18.0 27.2 17.5 35.1 16.1 40.3 14.0
3 Claude Opus 4T 51.9 12.7 30.8 14.7 41.7 13.5 47.7 11.1
4 GPT-5T 51.4 10.9 34.9 11.9 40.0 14.1 44.3 10.5
5 GPT-5T(Codex) 50.1 15.1 29.0 15.9 35.6 14.4 43.9 14.8
6 O4 MiniT 49.4 10.4 30.8 11.0 41.6 11.5 42.5 10.9
7 Claude Sonnet 4T 49.4 12.8 30.1 15.1 38.3 13.4 47.9 11.8
8 O3T 48.4 10.4 31.1 11.0 40.9 10.9 38.9 10.2
9 GPT-5T MiniT 48.2 10.5 36.2 10.7 40.5 13.2 41.0 12.1
10 Gemini 2.5 ProT 48.1 10.3 30.9 12.2 36.4 11.7 42.0 10.6
11 O3T MiniT 47.9 11.2 30.9 11.6 41.7 11.7 42.2 11.1
12 O1T 47.4 12.0 28.8 11.6 38.8 12.7 42.2 11.0
13 Claude 3.7 SonnetT 44.7 11.1 30.2 13.9 39.0 13.2 44.7 11.6
14 DeepSeek-R1T 44.4 10.7 25.5 13.6 36.8 10.6 39.5 9.9
15 GPT-4.1 44.0 9.6 29.0 12.6 39.3 10.1 38.7 9.9
16 Claude 3.7 Sonnet 43.3 12.6 29.0 12.9 36.4 14.2 40.7 11.7
17 GPT-4o 42.7 8.9 26.7 10.5 29.4 12.5 35.6 9.9
18 O1T MiniT 40.2 9.4 30.5 10.1 35.0 10.3 38.6 9.8
19 DeepSeek-V3 39.8 9.9 26.1 12.7 37.0 13.6 40.3 10.0
20 Claude 3.5 Sonnet 38.7 8.9 26.1 10.6 28.4 10.2 32.2 9.0
21 Qwen-2.5 Coder32B 36.2 7.8 25.6 9.9 29.2 9.0 33.5 7.6
22 Qwen-314B 27.5 8.0 14.6 11.2 17.2 11.0 27.5 8.1
23 Qwen-2.5 Coder14B 27.2 7.3 22.4 8.9 24.3 9.5 26.2 7.5
24 Gemini 2.5 FlashT 26.2 6.2 19.8 8.5 22.4 8.0 27.1 8.0
25 Qwen-38BT 22.4 9.6 15.7 10.9 19.1 8.6 23.9 8.9
26 Qwen-34BT 19.4 9.0 14.3 8.6 15.5 9.4 19.3 8.5
27 Qwen-2.5 Coder7B 19.3 9.3 14.2 10.1 19.6 9.0 19.2 10.3
28 Qwen-34B 18.8 9.2 13.4 9.5 15.6 9.8 19.4 9.5
29 Qwen-38B 18.6 9.5 14.8 10.5 16.3 10.3 23.3 8.7
30 Qwen-2.5 Coder3B 12.9 10.8 10.9 9.6 11.5 9.5 11.9 10.6
31 Qwen-31.7BT 11.6 9.9 8.8 6.7 11.3 9.1 13.8 8.7
32 Qwen-31.7B 10.8 10.1 8.5 8.1 9.5 7.6 10.1 9.8
33 Qwen-30.6BT 6.8 9.6 5.0 6.1 3.0 6.6 4.6 8.2
34 Qwen-30.6B 4.1 11.3 2.4 4.0 3.4 8.9 5.1 9.2
35 Qwen-2.5 Coder0.5B 2.8 7.5 4.5 5.2 4.2 6.0 3.0 7.6

1. "T denotes thinking"
2. "C&S: Correctness & Seccure/ C&I: Correct & Insecure"
3.“All models show sharp declines in correctness & security when moving from single-turn to multi-turn coding, even top models drop by 20-27%.
4. "Evaluation with Agentic scaffolds are highlithed in (Aider, Codex, OpenHands)"



Can LLMs Generate Correct and Secure Code in Multi-turn Settings?


llm_and_webagent
In real-world coding, developers often interact with LLMs in multi-turn settings. Can these AI coding systems generate correct and secure code in multi-turn settings? As our best knowledge, our MT-Sec is the first benchmark to provide systematic comparison between single-turn and multi-turn settings with shared test cases in both correctness and security. This help us understand in the exact same task, model can do fairly well in single-turn settings, but drop significantly in multi-turn settings.


MT-Sec at a Glance

2,376 multi-turn coding tasks across 6 programming languages, covering 27 critical security vulnerabilities (CWEs)

‣ Built on top of SECCODEPLT and BAXBENCH benchmarks

Captures three common multi-turn coding interaction types:

  • Expansion - gradually adding features and functionality
  • Editing - iterative fixes, modifications, and pivots
  • Refactoring - restructuring code for clarity and modularity

‣ Comprehensive evaluation of 32 state-of-the-art LLMs and 3 coding agents (Aider, Codex, OpenHands)

How MT-Sec is Built

We transformed single-turn secure coding tasks into realistic 3-turn conversations that mimic how developers actually code. Using an LLM with consitency guardrails, we generated multi-turn expansions, edits, and refactorings, then verified them with security human experts. This preserves the original correctness and security tests while simulating natural, interactive coding workflows.

BibTeX

@misc{rawal2025benchmarkingcorrectnesssecuritymultiturn,
        title={Benchmarking Correctness and Security in Multi-Turn Code Generation}, 
        author={Ruchit Rawal and Jeffrey Yang Fan Chiang and Chihao Shen and Jeffery Siyuan Tian and Aastha Mahajan and Tom Goldstein and Yizheng Chen},
        year={2025},
        eprint={2510.13859},
        archivePrefix={arXiv},
        primaryClass={cs.SE},
        url={https://arxiv.org/abs/2510.13859}, 
  }