Check out my website here! https://leaderboard.bycloud.ai/

In this video, I will be going through and explain the benchmarks for Chatbot Arena & Open LLM leaderboard. These are more general benchmarks for text-based LLMs, so HumanEval is not here. Let me know any other benchmarks you want me to explain in the future!

[Chatbot Arena] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
[Open LLM Leaderboard] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
[MMLU] https://huggingface.co/datasets/cais/mmlu
[ARC] https://huggingface.co/datasets/ai2_arc
[Winogrande] https://huggingface.co/datasets/winogrande
[TruthfulQA] https://huggingface.co/datasets/truthful_qa
[GSM8K] https://huggingface.co/datasets/gsm8k
[MT-Bench] https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts

This video is supported by the kind Patrons & YouTube Members: 
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony,  Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi

[Discord] https://discord.gg/NhJZGtH
[Twitter] https://twitter.com/bycloudai
[Patreon] https://www.patreon.com/bycloud

[Profile & Banner Art] https://twitter.com/pygm7
[Video Editor] Silas

0:00 Intro 
0:57 MMLU 
1:41 ARC
2:10 HELLASWAG
2:57 Winograde
3:27 TruthfulQA
3:52 GSM8K
4:26 MT-Bench
5:05 Outro

ByCloud's resource offers insights, tutorials, and resources for cloud computing enthusiasts, developers, and IT professionals. Readers can learn about cloud architecture, DevOps practices, and cloud-native technologies. With articles, tutorials, and case studies, ByCloud provides  guidance and expertise for leveraging cloud computing to build scalable and resilient applications.

bycloud

This post discusses seven popular benchmarks used to evaluate text-based large language models, including MML, Arc, HSWAG, Winograde, TruthfulQA, Grade School Math AK, and Empty Bench. Each benchmark serves a different purpose and measures different aspects of AI models.

7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]