23
Models Tested
87.0%
Success Rate
2m 3s
Avg Duration
48s - 5m 0s
Duration Range
Score Model Duration Session (KB) test_calculator_functions.sh test_file_exists.sh test_main_output.sh
100.0% openrouter/google/gemini-2.5-flash-preview-09-2025 1m 13s 23.9
100.0% openrouter/openai/gpt-5 2m 23s 167.5
100.0% openrouter/anthropic/claude-3-haiku 1m 16s 60.8
100.0% openrouter/openai/gpt-oss-120b 2m 5s 93.1
100.0% openrouter/qwen/qwen3-coder 2m 28s 61.0
100.0% openrouter/x-ai/grok-3-mini 1m 40s 640.9
100.0% openrouter/anthropic/claude-3.5-sonnet 1m 16s 52.6
100.0% openrouter/openai/gpt-4o-mini 2m 9s 166.1
100.0% openrouter/google/gemini-2.5-flash-lite-preview-09-2025 1m 19s 26.9
100.0% openrouter/openai/gpt-oss-20b 2m 7s 184.5
100.0% openrouter/anthropic/claude-3.7-sonnet 1m 35s 102.7
100.0% openrouter/anthropic/claude-haiku-4.5 1m 40s 84.0
100.0% openrouter/deepseek/deepseek-v3.1-terminus 3m 45s 111.4
100.0% litellm/GLM-4.5-Air-FP8-dev 2m 53s 40.8
100.0% openrouter/anthropic/claude-sonnet-4.5 1m 14s 29.7
100.0% openrouter/openai/gpt-4.1-nano 1m 35s 102.6
100.0% openrouter/openai/gpt-5-mini 1m 58s 126.5
100.0% openrouter/anthropic/claude-3.5-haiku 2m 47s 106.5
100.0% openrouter/anthropic/claude-sonnet-4 1m 30s 76.6
100.0% openrouter/openai/gpt-4.1-mini 1m 55s 135.1
0.0% openrouter/openai/gpt-5-nano 5m 0s 0.0
0.0% openrouter/google/gemini-2.5-pro 2m 28s 54.8
0.0% openrouter/deepseek/deepseek-chat-v3-0324 48s 32.9