Run of 2026-01-08 20:18:39

Models Tested

Total Tasks

53.8%

Overall Success Rate

7/13

Passed Combinations

Details

Score	Model	task10_multiple_tests	task11_relationship_classifier	task12_need_reply	task13_meeting_action_items	task14_graph_money_distribution	task1_file_list	task2_fix_python_syntax	task4_web_fetch	task5_dedup_contact	task6_config_merger	task7_log_parser	task8_regex_extraction	task9_cpp_footguns
53.8%	litellm/Mistral-Large-3-sandbox	✅	❌	❌	❌	❌	✅	✅	✅	✅	✅	❌	✅	❌