6
Models Tested
4
Total Tasks
25.0%
Overall Success Rate
6/24
Passed Combinations
| Score | Model | task1_file_list | task2_fix_python_syntax | task4_web_fetch | test_multiple_tests |
|---|---|---|---|---|---|
| 50.0% | openrouter/openai/gpt-oss-20b | ✅ | ❌ | ❌ | ✅ |
| 50.0% | openrouter/deepseek/deepseek-v3.1-terminus | ✅ | ❌ | ❌ | ✅ |
| 25.0% | openrouter/openai/gpt-oss-120b | ❌ | ❌ | ❌ | ✅ |
| 25.0% | openrouter/qwen/qwen3-coder | ❌ | ❌ | ❌ | ✅ |
| 0.0% | litellm/GLM-4.5-Air-FP8-dev | ❌ | ❌ | ❌ | ❌ |
| 0.0% | openrouter/qwen/qwen3-14b | ❌ | ❌ | ❌ | ❌ |