6
Models Tested
50.0%
Success Rate
1m 31s
Avg Duration
55s - 3m 46s
Duration Range
Score Model Duration Session (KB) test_1_file_exists.sh test_2_valid_json.sh test_3_json_structure.sh test_4_expected_content.sh
100.0% openrouter/openai/gpt-oss-120b 1m 7s 174.0
100.0% openrouter/deepseek/deepseek-v3.1-terminus 1m 15s 51.1
100.0% litellm/GLM-4.5-Air-FP8-dev 1m 2s 97.6
0.0% openrouter/qwen/qwen3-coder 55s 42.9
0.0% openrouter/openai/gpt-oss-20b 3m 46s 3389.8
0.0% openrouter/qwen/qwen3-14b 1m 2s 70.4