6
Models Tested
50.0%
Success Rate
48s
Avg Duration
13s - 1m 12s
Duration Range
Score Model Duration Session (KB) test_calculator_functions.sh test_file_exists.sh test_main_output.sh
100.0% openrouter/openai/gpt-oss-20b 1m 12s 313.9
100.0% openrouter/deepseek/deepseek-v3.1-terminus 1m 7s 68.4
100.0% litellm/GLM-4.5-Air-FP8-dev 39s 75.8
0.0% openrouter/openai/gpt-oss-120b 47s 37.4
0.0% openrouter/qwen/qwen3-coder 13s 1.3
0.0% openrouter/qwen/qwen3-14b 50s 71.0