6
Models Tested
4
Total Tasks
25.0%
Overall Success Rate
6/24
Passed Combinations
Score Model task1_file_list task2_fix_python_syntax task4_web_fetch test_multiple_tests
50.0% openrouter/openai/gpt-oss-20b
50.0% openrouter/deepseek/deepseek-v3.1-terminus
25.0% openrouter/openai/gpt-oss-120b
25.0% openrouter/qwen/qwen3-coder
0.0% litellm/GLM-4.5-Air-FP8-dev
0.0% openrouter/qwen/qwen3-14b