6
Models Tested
4
Total Tasks
33.3%
Overall Success Rate
8/24
Passed Combinations
Model Overall Score task1_file_list task2_fix_python_syntax task4_web_fetch test_multiple_tests
litellm/GLM-4.5-Air-FP8-dev 100.0%
qwen/qwen3-coder 75.0%
openai/gpt-oss-20b 25.0%
deepseek/deepseek-v3.1-terminus 0.0%
openai/gpt-oss-120b 0.0%
qwen/qwen3-14b 0.0%