23
Models Tested
8
Total Tasks
53.3%
Overall Success Rate
98/184
Passed Combinations
Score Model task1_file_list task2_fix_python_syntax task4_web_fetch task5_dedup_contact task7_log_parser task8_regex_extraction task9_cpp_footguns test_multiple_tests
87.5% openrouter/anthropic/claude-sonnet-4.5
87.5% openrouter/anthropic/claude-sonnet-4
75.0% openrouter/openai/gpt-5
75.0% openrouter/anthropic/claude-3.5-sonnet
75.0% openrouter/anthropic/claude-haiku-4.5
75.0% openrouter/anthropic/claude-3.5-haiku
75.0% openrouter/openai/gpt-4.1-mini
62.5% openrouter/anthropic/claude-3.7-sonnet
62.5% openrouter/deepseek/deepseek-v3.1-terminus
62.5% litellm/GLM-4.5-Air-FP8-dev
62.5% openrouter/openai/gpt-5-mini
50.0% openrouter/google/gemini-2.5-flash-preview-09-2025
50.0% openrouter/openai/gpt-5-nano
50.0% openrouter/qwen/qwen3-coder
50.0% openrouter/x-ai/grok-3-mini
50.0% openrouter/openai/gpt-4o-mini
37.5% openrouter/anthropic/claude-3-haiku
37.5% openrouter/openai/gpt-oss-120b
37.5% openrouter/openai/gpt-oss-20b
25.0% openrouter/google/gemini-2.5-pro
25.0% openrouter/openai/gpt-4.1-nano
12.5% openrouter/google/gemini-2.5-flash-lite-preview-09-2025
0.0% openrouter/deepseek/deepseek-chat-v3-0324