Run of 2026-01-08 16:50:12
2
Models Tested
13
Total Tasks
15.4%
Overall Success Rate
4/26
Passed Combinations
Details
| Score | Model | task10_multiple_tests | task11_relationship_classifier | task12_need_reply | task13_meeting_action_items | task14_graph_money_distribution | task1_file_list | task2_fix_python_syntax | task4_web_fetch | task5_dedup_contact | task6_config_merger | task7_log_parser | task8_regex_extraction | task9_cpp_footguns |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30.8% | litellm/GLM-4.5-Air-FP8-dev | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| 0.0% | litellm/Mistral-Large-3-sandbox | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |