Run of 2026-01-08 20:18:39
1
Models Tested
13
Total Tasks
53.8%
Overall Success Rate
7/13
Passed Combinations
Details
| Score | Model | task10_multiple_tests | task11_relationship_classifier | task12_need_reply | task13_meeting_action_items | task14_graph_money_distribution | task1_file_list | task2_fix_python_syntax | task4_web_fetch | task5_dedup_contact | task6_config_merger | task7_log_parser | task8_regex_extraction | task9_cpp_footguns |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53.8% | litellm/Mistral-Large-3-sandbox | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |