6
Models Tested
4
Total Tasks
29.2%
Overall Success Rate
7/24
Passed Combinations
Model Overall Score task1_file_list task2_fix_python_syntax task4_web_fetch test_multiple_tests
qwen/qwen3-coder 75.0%
openai/gpt-oss-20b 50.0%
openai/gpt-oss-120b 50.0%
deepseek/deepseek-v3.1-terminus 0.0%
litellm/GLM-4.5-Air-FP8-dev 0.0%
qwen/qwen3-14b 0.0%