6
Models Tested
50.0%
Success Rate
53s
Avg Duration
40s - 1m 2s
Duration Range
Score Model Duration Session (KB) test_1_file_exists.sh test_2_valid_json.sh test_3_json_structure.sh test_4_expected_content.sh
100.0% openai/gpt-oss-20b 1m 2s 420.3
100.0% litellm/GLM-4.5-Air-FP8-dev 1m 2s 91.8
100.0% qwen/qwen3-coder 59s 64.7
0.0% deepseek/deepseek-v3.1-terminus 41s 25.0
0.0% openai/gpt-oss-120b 55s 175.6
0.0% qwen/qwen3-14b 40s 24.9