6
Models Tested
33.3%
Success Rate
58s
Avg Duration
28s - 1m 26s
Duration Range
Score Model Duration Session (KB) test_calculator_functions.sh test_file_exists.sh test_main_output.sh
100.0% litellm/GLM-4.5-Air-FP8-dev 1m 26s 67.9
100.0% qwen/qwen3-coder 1m 18s 67.9
0.0% deepseek/deepseek-v3.1-terminus 28s 24.9
0.0% openai/gpt-oss-20b 1m 3s 108.3
0.0% openai/gpt-oss-120b 1m 1s 156.7
0.0% qwen/qwen3-14b 30s 24.9