6
Models Tested
66.7%
Success Rate
1m 10s
Avg Duration
32s - 2m 4s
Duration Range
Score Model Duration Session (KB) test_1_file_exists.sh test_2_valid_json.sh test_3_json_structure.sh test_4_expected_content.sh
100.0% openrouter/openai/gpt-oss-120b 1m 9s 195.5
100.0% openrouter/qwen/qwen3-coder 1m 8s 62.1
100.0% openrouter/deepseek/deepseek-v3.1-terminus 1m 12s 62.8
100.0% litellm/GLM-4.5-Air-FP8-dev 2m 4s 83.1
0.0% openrouter/openai/gpt-oss-20b 55s 138.6
0.0% openrouter/qwen/qwen3-14b 32s 70.8