6
Models Tested
4
Total Tasks
33.3%
Overall Success Rate
8/24
Passed Combinations
| Model | Overall Score | task1_file_list | task2_fix_python_syntax | task4_web_fetch | test_multiple_tests |
|---|---|---|---|---|---|
| litellm/GLM-4.5-Air-FP8-dev | 100.0% | ✅ | ✅ | ✅ | ✅ |
| qwen/qwen3-coder | 75.0% | ✅ | ✅ | ❌ | ✅ |
| openai/gpt-oss-20b | 25.0% | ✅ | ❌ | ❌ | ❌ |
| deepseek/deepseek-v3.1-terminus | 0.0% | ❌ | ❌ | ❌ | ❌ |
| openai/gpt-oss-120b | 0.0% | ❌ | ❌ | ❌ | ❌ |
| qwen/qwen3-14b | 0.0% | ❌ | ❌ | ❌ | ❌ |