Run of 2026-01-08 16:50:12

Models Tested

Total Tasks

15.4%

Overall Success Rate

4/26

Passed Combinations

Details

Score	Model	task10_multiple_tests	task11_relationship_classifier	task12_need_reply	task13_meeting_action_items	task14_graph_money_distribution	task1_file_list	task2_fix_python_syntax	task4_web_fetch	task5_dedup_contact	task6_config_merger	task7_log_parser	task8_regex_extraction	task9_cpp_footguns
30.8%	litellm/GLM-4.5-Air-FP8-dev	✅	✅	❌	❌	❌	❌	✅	❌	❌	❌	❌	✅	❌
0.0%	litellm/Mistral-Large-3-sandbox	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌