The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think
A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.
Weights and Biases' Weave: https://wandb.me/ai_explained
AI Insiders: https://www.patreon.com/AIExplained
Chapters:
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update
New Claude: https://www.anthropic.com/news/3-5-models-and-computer-use
https://www.anthropic.com/research/developing-computer-use
Paper: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf
Demo Diversion: https://x.com/AnthropicAI/status/1848742761278611504
https://www.youtube.com/watch?v=jqx18KgIzAE
o1 Comparison: https://openai.com/index/learning-to-reason-with-llms/
https://www.swebench.com/
Tau Bench: https://arxiv.org/pdf/2406.12045
OSWorld: https://arxiv.org/pdf/2404.07972
GSM Reasoning: https://arxiv.org/pdf/2410.05229
Sierra Valuation: https://www.theinformation.com/articles/bret-taylors-ai-agent-startup-nears-deal-that-could-value-it-at-over-4-billion?rc=sy0ihq
Claude Impressions: https://x.com/skirano/status/1848750867245133982
o1 System Card: https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf
NotebookLM: https://notebooklm.google/
Runway Act-One: https://runwayml.com/research/introducing-act-one
HeyGen Zoom: https://labs.heygen.com/interactive-avatar/vicky
Ministral Comparison: https://x.com/armandjoulin/status/1846581336909230255
My Coursera Course - The 8 Most Controversial Terms in AI: https://imp.i384100.net/m57g3M
Non-hype Newsletter: https://signaltonoise.beehiiv.com/
I use Descript to edit my videos (no pauses or filler words!): https://get.descript.com/ldgxfuj2bhnb
Many people expense AI Insiders for work. Feel free to use the Template in the 'About Section' of my Patreon.
https://www.patreon.com/AIExplained
Posted Oct 23
click to rate
Share this page with your family and friends.