Avsnitt

  • The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Amodei’s cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well.

    Weights & Biases: https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained


    DeepSeek Doc: https://www.patreon.com/posts/openai-is-not-r1-125869969

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:47 - Stock Crash
    02:28 - Llama 4
    10:55 - o3 News
    11:59 - OpenAI non-profit?
    13:13 - AI 2027

    Llama 4 Release: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    Dario Amodei Comments: https://www.youtube.com/watch?v=esCSpbDPJik

    Knowledge Cut-off: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/

    Aider Polyglot: https://aider.chat/docs/leaderboards/

    Gemini 1.5: https://arxiv.org/pdf/2403.05530

    Fiction-LiveBench: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

    OpenAI Valuation: https://www.nytimes.com/2025/03/31/technology/openai-valuation-300-billion.html?login=smartlock&auth=login-smartlock

    OpenAI Cybersecurity: https://www.bloomberg.com/news/articles/2024-01-16/openai-working-with-us-military-on-cybersecurity-tools-for-veterans

    Deep research System Card: https://cdn.openai.com/deep-research-system-card.pdf

    https://openai.com/index/paperbench/

    AI 2027: https://ai-2027.com/

    METR Paper: https://arxiv.org/pdf/2503.14499

    OpenAI non-profit: https://openai.com/index/nonprofit-commission-guidance/

    NYT Piece: https://www.nytimes.com/2025/04/03/technology/ai-futures-project-ai-2027.html?unlocked_article_code=1.804._yKi.QhwOp15Q3tcU&smid=url-share&s=09

    Kokotajlo predictions 2021: https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like

    https://simple-bench.com/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

  • Gemini gets a new record on Simple Bench, and several other benchmarks. I’ll go deep to explore its nuances, including how it deceptively reverse engineers answers, does better on certain coding benchmarks than others, may have a universal ‘conceptual language’ …

    https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained

    … and more. Plus practical tips, a note on security and Kling vs Veo 2 guest appearance.


    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:36 - Fiction Bench
    02:41 - Practicality - YouTube urls + Security - cut-off date
    03:42 - Coding
    06:22 - WeirdML Bench
    07:01 - Simple Bench Record High
    11:23 - Reverse Engineering!
    13:22 - Anthropic Paper
    17:49 - 3 Caveats

    Gemini 2.5 Updated: https://deepmind.google/technologies/gemini/

    Fiction Live Bench: https://fiction.live/stories/Fiction-liveBench-Feb-19-2025/oQdzQvKHw8JyXbN87

    https://simple-bench.com/

    WeirdML: https://htihle.github.io/weirdml.html
    https://x.com/htihle/status/1905014058228625542

    Anthropic Thoughts: https://www.anthropic.com/research/tracing-thoughts-language-model
    https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-cot

    https://aistudio.google.com/prompts/new_chat

    Search Study: https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

    Live bench: https://livebench.ai/#/
    Paper: https://arxiv.org/pdf/2406.19314

    LiveCode Bench: https://livecodebench.github.io/

    SWE-Verified: https://arxiv.org/pdf/2310.06770


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  • Saknas det avsnitt?

    Klicka här för att uppdatera flödet manuellt.

  • Gemini 2.5 is out, on the same day as the new DeepSeek V3 (which should power Deepseek R2). Do both models prove AI is being commoditized? Let’s find out, on this blockbuster day of AI releases. Plus exclusives from the Information, Simple indications, Vista Bench, LM Arena and more…

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    01:15 - Gemini 2.5 Benchmarks
    05:46 - Long Context, Simple indication
    07:08 - New Deepseek V3 -024
    09:11 - Microsoft MAI
    11:48 - 90% of code but new Claude jobs

    ‘World’s most powerful model’: https://x.com/OfficialLoganK/status/1904580368432586975

    Gemini 2.5 Release Notes: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

    ‘Commoditized’: https://the-decoder.com/microsoft-ceo-satya-nadella-says-ai-models-are-getting-commoditized/

    Microsoft Information report: https://www.theinformation.com/articles/microsofts-ai-guru-wants-independence-from-openai-thats-easier-said-than-done?rc=sy0ihq

    LMarena: https://x.com/lmarena_ai/status/1904581128746656099/photo/1

    Free for now: https://x.com/btibor91/status/1904578053537476628

    Vista Bench:https://scale.com/leaderboard/visual_language_understanding

    DeepSeek V3: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324

    Claude Plays Pokemon: https://www.twitch.tv/claudeplayspokemon
    Amodei: 100% Coding: https://www.youtube.com/watch?v=esCSpbDPJik&t=3017s

    Anthropic Jobs: https://job-boards.greenhouse.io/anthropic/jobs/4020717008

    Microsoft Money from Onslaught: https://www.972mag.com/microsoft-azure-openai-israeli-army-cloud/

    https://simple-bench.com/

    Release Date Comments: https://x.com/zacharynado/status/1904647277861318979


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  • Is Manus AI the memecoin of the AI world, or legit? I’ll compare it to OpenAI’s Deep Research, Operator, Grok 3 DeepSearch and more to find out. I’ll also let you in on some of the secrets of what makes a good hype campaign, the estimated costs of Manus AI, and where it is strong. Other news (yes, Gemini image editing and research hacking, I mean you), will have to wait for a few more hours, as millions enquire about Manus AI.

    https://app.grayswan.ai/arena

    AI Insiders ($9!): https://www.patreon.com/AIExplained
    Patreon Vid: https://www.patreon.com/posts/4-ai-trends-in-123857767

    Chapters:
    00:00 - Introduction
    00:46 - Hype Campaign
    02:40 - Single, Public Benchmark
    03:12 - What is Manus AI?
    04:22 - Test 1
    05:12 - Cost and Rate Limits
    06:15 - Test 2 vs Deep Research + Grok 3 DeepSearch
    08:24 - Test 3 (not AGI)
    11:10 - 4 Trends in AI in 2025
    11:37 - Hype Works

    Manus AI: https://manus.im/app

    Xiao Hong Interview: https://www.chinatalk.media/p/manus-chinas-latest-ai-sensation

    Gaia Benchmark: https://openreview.net/pdf?id=fibxvahvs3
    MIT Report: https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/

    Information Report: https://www.theinformation.com/articles/anthropics-claude-drives-strong-revenue-growth-while-powering-manus-sensation?rc=sy0ihq

    Hype Examples: https://x.com/Saboo_Shubham_/status/1898425707401031940
    https://x.com/EHuanglu/status/1899110687902978373
    https://x.com/AJs_AI/status/1898756132384178291

    Mistakes: https://x.com/TheXeophon/status/1898737178273829220

    Tools and Code: https://x.com/peakji/status/1898994802194346408

    https://operator.chatgpt.com/




    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

  • GPT 4.5 is here, and do you remember when AI lab CEOs like Sam Altman and Dario Amodei were betting everything on scaling up base models like this one? Well let’s find out what would have happened if the future of AI rested on models like GPT 4.5. You’ll see all the benchmarks, highlights of the paper, emotional intelligence and humor tests, Simple Bench results (reddit was an unreliable source), and why it’s not all bad news for OpenAI.

    https://www.emergentmind.com/

    AI Insiders (now $9!): https://www.patreon.com/AIExplained

    Chapters
    00:00 - Introduction
    01:04 - Details and Benchmarks
    03:04 - Emotional intelligence?
    08:37 - Creative writing?
    11:40 - Visual reasoning and Pricing
    12:41 - Simple Performance
    16:01 - End of Pretraining Scaling?
    17:03 - CEO Hype
    18:11 - System Card Highlights
    23:32 - Karpathy Reaction

    GPT 4.5 System card: https://cdn.openai.com/gpt-4-5-system-card-2272025.pdf
    Release Notes: https://openai.com/index/gpt-4-5-system-card/
    Altman Hype: https://x.com/sama/status/1891533802779910471
    Details: https://openai.com/index/introducing-gpt-4-5/ https://x.com/OpenAI/status/1895219596317335792
    End of an Era: https://x.com/wgussml/status/1895187231666774377
    Anthropic Original Claim: https://techcrunch.com/2023/04/06/anthropics-5b-4-year-plan-to-take-on-openai/
    Smell: https://x.com/rapha_gl/status/1895213014699385082
    Bob McGrew: https://x.com/bobmcgrewai/status/1895228291981943265
    Deep Research System Card: https://cdn.openai.com/deep-research-system-card.pdf
    Reddit: https://www.reddit.com/r/singularity/comments/1izu1t7/gpt45_crushes_simple_bench/
    API Pricing: https://openai.com/api/pricing/
    LiveStream: https://www.youtube.com/watch?v=cfRYp0nItZ8&t=1s
    https://simple-bench.com/


    Karpathy Comparison: https://x.com/karpathy/status/1895213020982472863
    https://x.com/karpathy/status/1895337579589079434


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  • Claude 3.7 is here, hot on the heels of Grok 3 and a host of other developments, but how good is it really? And what does it say about the next few months in AI? I’ve read the papers, played with the model for hours, and benched it on Simple. Things aren’t slowing down. Plus the latest in humanoid robots, led by Helix and freaked out by Protoclone. And reports of GPT 4.5 and DeepSeek R2.

    GraySwan Competition! https://app.grayswan.ai/arena/challenge/agent-red-teaming

    https://x.com/GraySwanAI/status/1894084923260043282

    Chapters:

    00:00 - Introduction

    01:25 - Claude 3.7 New Stats/Demos

    05:22 - 128k Output

    06:13 - Pokemon

    06:58 - Just a tool?

    09:54 - DeepSeek R2

    10:20 - Claude 3.7 System Card/Paper Highlights

    17:18 - Simple Record Score/Competition

    20:37 - Grok 3 + Redteaming prizes

    22:26 - Google Co-scientist

    24:02 - Humanoid Robot Developments

    3.7 Release Notes: https://www.anthropic.com/news/claude-3-7-sonnet

    vs o3 and Grok 3: https://x.com/12exyz/status/1891723056931827959

    Extended Thinking: https://www.anthropic.com/research/visible-extended-thinking?s=09

    System Prompt: https://docs.anthropic.com/en/release-notes/system-prompts#feb-24th-2025

    System Card: https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf

    Unfaithful CoT: https://arxiv.org/pdf/2305.04388

    Original Constitution: https://www.anthropic.com/news/claudes-constitution

    Responsible Scaling Policy: https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf

    Amodei and Hassabis:https://www.youtube.com/watch?v=4poqjZlM8Lo

    https://simple-bench.com/

    400 Weekly Users: https://x.com/bradlightcap/status/1892579908179882057

    Grok 3 Jailbroken: https://x.com/LinusEkenstam/status/1893832876581380280

    Google Co-Scientist: https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/

    But Hassabis Says Years Away: https://www.youtube.com/watch?v=yr0GiSgUvPU&t=156s

    DeepSeek R2 Reuters: https://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/

    Protoclone: https://www.reddit.com/r/interestingasfuck/comments/1it9rpp/protoclone_the_worlds_first_bipedal/

    Helix: https://www.figure.ai/news/helix

    TechTrance: https://www.youtube.com/@TheTechTrance/videos

    GPT 4.5 Soon:

  • A 'frontier reasoning model' from just 1000 examples (s1). A $100B Musk bid for power. Gemini 2, Rand and warning from Amodei. Here’s 7-8 developments you may have missed but which I would argue help us understand how the next few years will play out. From labour vs capital to automating rival companies and countries, and from non-profit shenanigans to new mini-docs, there was just too much for me not to make a vid.

    GiveWell: https://www.givewell.org/charities/top-charities

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    s1 Paper: https://arxiv.org/pdf/2501.19393
    Musk Bid: https://www.wsj.com/tech/ai/musks-97-4-billion-openai-bid-piles-pressure-on-altman-f6749e6c?mod=hp_lead_pos1
    Altman Reply: https://x.com/sama/status/1889059531625464090?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Etweet
    Google vs OpenAI: https://x.com/sama/status/1888703820596977684
    RAND Study: https://www.rand.org/pubs/perspectives/PEA3691-4.html
    Dev Meetup: https://x.com/btibor91/status/1888976302621040852
    Altman $100 Trillion: https://www.nytimes.com/2023/03/31/technology/sam-altman-open-ai-chatgpt.html
    Karpathy Vid: https://www.youtube.com/watch?v=7xTGNNLPyMI
    Amodei Warning: https://www.anthropic.com/news/paris-ai-summit
    Bengio Source: https://www.youtube.com/watch?v=6HDjVncL5Go

    Chapters:
    00:00 - Intro
    01:37 - AGI Inches Closer
    04:26 - ‘Super-Exponential’
    05:58 - Musk Bid
    07:34 - Luxury Goods and Land
    09:05 - ‘Benefits All Humanity’
    12:52 - ‘National Security’
    14:21 - s1
    20:33 - Final thoughts


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  • 12 hours ago Deep Research was unveiled, and I’ve tested it thoroughly, including vs Deepseek R1 with search, Gemini Deep Research and even R1 in Perplexity. It’s a notable step forward, with one big caveat. I’ll go through all the benchmark figures, my initial impression of the o3 model within, and much more.

    Deep Research: https://openai.com/index/introducing-deep-research/

    https://www.youtube.com/watch?v=YkCDVn3_wiw

    GAIA Bench: https://openreview.net/forum?id=fibxvahvs3

    https://openreview.net/pdf?id=fibxvahvs3

    CodeELO:https://arxiv.org/pdf/2501.01257

    CamelCamel:https://uk.camelcamelcamel.com/

    Deepseek R1 with search: https://chat.deepseek.com/

    https://arxiv.org/pdf/2501.12948

    HaluBench: https://arxiv.org/pdf/2407.08488

    Chapters:

    00:00 - Introduction

    01:06 - Powered by o3, Humanity’s Last Exam, GAIA

    03:55 - Simple Tests

    06:00 - Good News vs Deepseek R1 and Gemini Deep Research

    09:32 - Bad News on Hallucinations

    14:14 - What Can’t it Browse?

    14:42 - For Shopping?

    16:40 - Final thoughts



  • o3-mini is here, and yes, I’ve read the paper in full - 2 hours after release, and even the post-launch Reddit AMA. Some epic details like a FrontierMath score that made me double-take, a likely new Cursor favorite, bio risk expertise and a cost-comparison with Deepseek R1., But does it perform on basic reasoning - let’s find out. Plus, arguably the bigger story - the increasingly frenetic rhetoric coming out of the West - and Dario Amodei and Alexandr Wang (CEOs of Anthropic and Scale AI respectively) in particular. The last thing we need is an “AI War”.

    https://wandb.me/simple-bench

    (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing


    Chapters:

    00:00 - Introduction

    00:45 - o3 mini

    05:11 - First impressions vs Deepseek R1

    07:21 - 10x Scale, o3-mini System Card, Amodei Essay, bitcoin wallets…

    12:40 - Simple Competition Finale

    13:03 - Clips and Final Thoughts on the “AI War”



    O3-mini: https://openai.com/index/openai-o3-mini/

    Paper: https://cdn.openai.com/o3-mini-system-card.pdf

    Amodei Essay: https://darioamodei.com/on-deepseek-and-export-controls?s=09

    FrontierMath wild stat:https://arxiv.org/pdf/2411.04872

    Sam Altman Channels Napoleon: https://x.com/sama/status/1883185690508488934

    Altman ‘pulls up releases’: https://x.com/sama/status/1884066337103962416

    “AI War” by Wang: https://scale.com/blog/win-the-ai-war

    Anthropic Original Views on Capabilities: https://www.anthropic.com/news/core-views-on-ai-safety

    AI Insider Cost Comparison:https://x.com/arankomatsuzaki/status/1884676245922934788

    Deepseek R1 Paper: https://arxiv.org/pdf/2501.12948

    R1, o3-mini Price Comparison: https://techcrunch.com/2025/01/31/openai-launches-o3-mini-its-latest-reasoning-model/

    Semianalysis on $1,3M deepseek salaries, and them falling behind as ‘the time gap to match US capabilities increases’: https://semianalysis.com/2025/01/31/deepseek-debates/

    OpenAI Valuation: https://www.bloomberg.com/news/articles/2025-01-30/openai-in-talks-to-raise-funding-at-340-billion-value-wsj-says?srnd=phx-ai

    Wang Clip: https://x.com/tsarnick/status/1867700453494206883

    Amodei Clip: https://x.com/ai_ctrl/status/1884951111771001188

    https://simple-bench.com/



  • When it rains, it pours. OpenAI Operator tested and reviewed, with full paper analysis. Perplexity Assistant is useful. Then Stargate, is it all smoke and mirrors? Strong rumours of an o3+ model from Anthropic. Then a full breakdown of Deepseek R1, and what it’s training method says about the state of AI. It’s not open source BTW. Plus Humanity’s Last Exam, and Hassabis Accelerates his AGI timeline.

    00:00 - Introduction

    00:54 - OpenAI Operator

    04:53 - Perplexity Assistant

    05:15 - StarGate

    07:51 - Better than o3?

    08:25 - DeepSeek R1 Analysis

    12:12 - Training Secrets

    15:19 - No More Process Rewarding ?

    19:01 - Hassabis Timeline Accelerates

    21:22 - Humanity’s Last Exam

    https://app.grayswan.ai/arena/chat/harmful-ai-assistant

    https://app.grayswan.ai/arena

    https://openai.com/index/computer-using-agent/

    System Prompt: https://github.com/wunderwuzzi23/scratch/blob/master/system_prompts/operator_system_prompt-2025-01-23.txt

    OpenAI Operator: https://operator.chatgpt.com/

    System Card: https://cdn.openai.com/operator_system_card.pdf

    There is No Plan: https://x.com/jeffclune/status/1882120726339318007

    Perplexity Assistant: https://x.com/perplexity_ai/status/1882466239123255686

    Stargate: https://openai.com/index/announcing-the-stargate-project/

    Labour goes to 0: https://moores.samaltman.com/

    Larry Ellison AI Surveillance: https://x.com/TheChiefNerd/status/1882042989184430332

    Amodei 1984: https://www.bloomberg.com/news/articles/2025-01-22/anthropic-ceo-says-openai-s-stargate-venture-seems-chaotic

    Microsoft Hesitate: https://www.theinformation.com/articles/why-sam-altman-joined-forces-with-larry-ellison-and-took-a-step-back-from-microsoft?rc=sy0ihq

    Dylan Patel o3+ for Anthropic: https://www.youtube.com/watch?v=7EH0VjM3dTk

    Deepseek R1: https://arxiv.org/pdf/2501.12948

    https://arxiv.org/pdf/2412.19437

    Diagram: https://pbs.twimg.com/media/GhyQsM6WQAE7W52?format=jpg&name=large

    https://simple-bench.com/

    Process: https://x.com/sama/status/1664018190840614912

    https://x.com/karpathy/status/1835561952258723930

    https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/?s=09

    Demis Interview: https://www.youtube.com/watch?v=yr0GiSgUvPU

    Humanity’s Last Exam:

    https://agi.safe.ai/

    https://x.com/DanHendrycks/status/1882481730671857815

    https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html?s=09



  • OpenAI looks set to debut their Operator system, and some leaks are out. At the same time Deepseek R1 releases some numbers, and Sam Altman says he might have been wrong before, and now anticipates a 'fast take-off'. Plus two papers to give you an idea of what a super-agent might be decent at doing, some more exclusive article analysis and much more. Who said anything else is happening today...

    80,000 Hours Channel: https://www.youtube.com/channel/UCafjal1QYJ3rb0Y9xZk1Ezg
    Spotify: https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    01:13 - Pro Cost and OpenAI Operator
    04:00 - Agent Benchmarks Being Targeted
    07:48 - Fast Take-off, Altman
    08:48 - Altman flip-flops
    10:02 - Deepseek R1 First Reaction

    Altman ‘100x expectations out of control’: https://x.com/sama/status/1881258443669172470
    OpenAI Operator Table: https://x.com/btibor91/status/1881285255266750564
    WebVoyager: https://arxiv.org/pdf/2401.13919
    OSWorld: https://arxiv.org/pdf/2404.07972
    Axios Exclusive 1 (SuperAgent): https://www.axios.com/2025/01/19/ai-superagent-openai-meta?s=09
    Axios Exclusive 2: https://www.axios.com/2025/01/18/biden-sullivan-ai-race-trump-china
    Deepseek R1 Numbers: https://x.com/deepseek_ai/status/1881318130334814301
    Does 1.5B outperform 3.5 Sonnet on Math?: https://x.com/reach_vb/status/1881319500089634954
    Deepseek R1 (deepseek-reasoner) Pricing: https://api-docs.deepseek.com/quick_start/pricing/
    Altman Fast Takeoff: https://x.com/tsarnick/status/1879100390840697191
    OpenAI Economic Blueprint: https://cdn.openai.com/global-affairs/ai-in-america-oai-economic-blueprint-20250113.pdf
    Target is Long-horizon Tasks: https://x.com/karinanguyen_/status/1879576037249667520
    Support Regulations: https://www.techemails.com/p/elon-musk-and-openai
    https://www.nytimes.com/2023/05/16/technology/openai-altman-artificial-intelligence-regulation.html
    Donation: https://qz.com/sam-altman-donate-million-zuckerberg-bezos-donald-trump-1851721035
    Amodei on Regulations by 2025: https://www.youtube.com/watch?v=ugvHCXCOmm4
    ‘Feel the AGI’: https://x.com/polynoamial?lang=en
    GPT-5 and o-series merger: https://x.com/sama/status/1880358749187240274
    o1 Thinks in Chinese: https://techcrunch.com/2025/01/14/openais-ai-reasoning-model-thinks-in-chinese-sometimes-and-no-one-really-knows-why/



    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

  • Sam Altman unexpectedly brings his timelines to AGI forward, while OpenAI backtrack on superintelligence. None of these changes were heralded, but they are significant. Plus the new year brings new assessments of the true capability of models to automate 'large swathes of the economy'. I'll give my prediction on that front for 2025, announcement a new Simple Bench competition, and showcase Kling 1.6 vs Veo 2 vs Sora, and much more.

    wandb.me/simple-bench

    (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing

    TheAgentCompany Paper: https://arxiv.org/pdf/2412.14161v1

    Sam Altman Major Interview: https://www.bloomberg.com/features/2025-sam-altman-interview/?srnd=phx-ai

    OpenAI Agent Coming Jan 2025: https://www.theinformation.com/articles/why-openai-is-taking-so-long-to-launch-agents?rc=sy0ihq

    Altman Singularity: https://x.com/sama/status/1875603249472139576

    Altman Original Timeline: https://www.youtube.com/watch?v=7dCPytNTnjk&t=621s

    https://www.ft.com/content/34a7a082-e685-4e02-bca7-61ff89d99ed2

    OpenAI Original Emails: https://www.lesswrong.com/posts/5jjk4CDnj9tA7ugxr/openai-email-archives-from-musk-v-altman-and-openai-blog

    DeepMind Sky News 2014 Article: https://news.sky.com/story/google-buys-uk-intelligence-firm-deepmind-10419783

    Altman Blog Reflections: https://blog.samaltman.com/reflections

    OpenAI Changes Who Gets AGI: https://openai.com/index/why-our-structure-must-evolve-to-advance-our-mission/?s=09

    OpenAI 5 Levels: https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai

    Altman 2015: https://blog.samaltman.com/machine-intelligence-part-1

    OpenAI React to Anthropic: https://www.theinformation.com/articles/how-anthropic-got-inside-openais-head?rc=sy0ihq

    Microsoft $100B Definition: https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership?rc=sy0ihq
    Epoch Scramble for Task Benchmark: https://x.com/tamaybes/status/1876692639363612919

    GPQA Progress: https://epoch.ai/data/ai-benchmarking-dashboard

    Task Length Crucial for ARC-AGI: https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi

    RL Environment Tweet: https://x.com/vedantmisra/status/1876327518157807990

    Jason Wei Talk: https://www.youtube.com/watch?v=yhpjpNXJDco

    Miles Brunda

  • o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.

    FrontierMath: https://epoch.ai/frontiermath

    https://arxiv.org/pdf/2411.04872

    Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough

    MLC Paper:

    https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social

    AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

    Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1

    Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614

    Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/

    Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893

    Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/

    Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518

    David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638

    OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/

    https://simple-bench.com/

    John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725

    00:00 - Introduction

    01:19 - What is o3?

    03:18 - FrontierMath

    05:15 - o4, o5

    06:03 - GPQA

    06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2

    08:13 - 1st Caveat

    09:03 - Compositionality?

    10:16 - SimpleBench?

    13:11 - ARC-AGI, Chollet



  • The ‘Gemini 2 Era’ begins … with screen-sharing? But really, it’s a great free tool, for curiosity satisfying rather than bleeding-edge intelligence. I give you the benchmarks, the highlights and of course, the latest from OpenAI Advanced Voice Mode with Vision.

    Plus Deep Research in Gemini Advanced, Simple Bench updates, Santa and what might be for some of you Google’s deflating admission.

    00:00 - Introduction

    00:38 - Live Interaction

    03:43 - Gemini 2.0 Flash Benchmarks

    05:10 - Audio and Image Output

    06:38 - Project Mariner (+ WebVoyager Bench)

    08:49 - But Progress Slowing Down?

    10:43 - OpenAI Announcements + Games



    https://aistudio.google.com/live

    Gemini 2.0 Flash Benchmarks: https://deepmind.google/technologies/gemini/

    Project mariner: https://deepmind.google/technologies/project-mariner/

    WebVoyager: https://x.com/laurentsifre/status/1858918588683296875/photo/1

    Gemini Game play: https://www.youtube.com/watch?v=IKuGNHJBGsc

    Advanced Voice Mode OpenAI: https://www.youtube.com/watch?v=NIQDnWlwYyQ

    https://simple-bench.com/

    Claude Computer Use: https://docs.anthropic.com/en/docs/build-with-claude/computer-use

    Oriol Vinyals Interview: https://www.youtube.com/watch?v=78mEYaztGaw&t=687s



  • After a 10 month wait, OpenAI have released Sora to paying users. With just a prompt it can generate videos of up to 20 seconds in lower resolutions, and 10 seconds at 1080p if you can fork out $200/month. I’ve tested it and read the system card. The user interface is quite beautiful, even if the videos themselves operate until entirely new rules of physics. But I can’t help wondering if OpenAI want up to focus on releases like this, rather than some quietly broken promises.



    80,000 hours Website, Podcast + Channel:

    https://80000hours.org/

    https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib https://www.youtube.com/@eightythousandhours/videos

    https://openai.com/sora/

    Sora Countries: https://help.openai.com/en/articles/10250692-sora-supported-countries

    Sora Credits: https://help.openai.com/en/articles/10245774-sora-billing-credits-faq

    https://runwayml.com/ and https://pika.art/home

    DeepMind Veo: https://deepmind.google/technologies/veo/

    Sam Altman Ads as Last Resort: https://www.windowscentral.com/software-apps/openai-could-chase-intrusive-ads-as-last-resort

    But OpenAI Considering Ads: https://www.inc.com/ben-sherry/is-openai-getting-into-the-advertising-business-the-company-is-sending-mixed-messages/91033533

    OpenAI Backtracks on Microsoft AGI Clause: https://www.ft.com/content/2c14b89c-f363-4c2a-9dfc-13023b6bce65

    As Microsoft Boast of Labor Savings: https://www.theinformation.com/articles/microsofts-new-sales-pitch-for-ai-spend-less-money-on-humans?rc=sy0ihq

    OpenAI Military Pivot: https://www.technologyreview.com/2024/12/04/1107897/openais-new-defense-contract-completes-its-military-pivot/

    Employees Have Doubts: https://www.washingtonpost.com/technology/2024/12/06/openai-anduril-employee-military-ai/?nid=top_pb_signin&arcId=KZIV7PLRHBCVNPAIAAAVUNRHIM&account_location=ONSITE_HEADER_ARTICLE



  • Oh boy. o1 pro mode out on the same night as o1 full. I read the 49 page paper, ran my own tests, spent my fuel allowance on Pro Mode and will give you all the highlights. Suffice to say the story is not as simple as it first appears.

    Weights and Biases’ Weave: wandb.me/ai_explained

    Plus, GPT-4.5? MLE Bench, Simple Update, Image Analysis and much more

    o1 System Card: https://cdn.openai.com/o1-system-card-20241205.pdf

    Apollo Research: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

    Altman Tweet: https://x.com/AnonCEOMakeItAi/status/1864763052622504344

    ChatGPT Pro: https://openai.com/index/introducing-chatgpt-pro/

    Tibor Blaho: https://x.com/btibor91/status/1864709670470066605

    Simple-bench.com

    00:00 - Introduction

    00:27 - ChatGPT Pro is $200

    01:25 - OpenAI Benchmarks

    03:20 - o1 System Card, o1 and o1 Pro Mode vs o1-preview

    06:18 - Simple Bench surprising results on sample

    08:31 - Weight & Biases

    09:05 - Image Analysis Compared

    12:51 - More Benchmarks and Safety

  • Calmest before the storm? Whatever analogy you want to use things had gotten quiet toward the end of 2024. But then tonight we got Genie 2, and a series of scheduled announcements from OpenAI. Sora is soon here, and o1, but I dive deeper into what it all means and whether reliability is on a path to being solved, ft: two recent papers.

    Assembly AI Speech to Text: https://www.assemblyai.com/?utm_source=youtube&utm_medium=influencer&utm_campaign=ai_explained

    Plus Kling Motion Brush, Simple Bench QwQ update and much more.


    Genie 2: https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/

    Jim Cramer: https://x.com/jimcramer/status/1864068878692675625

    Give Us Full o1: https://x.com/tszzl/status/1863882905422106851

    Verge Scoop: https://x.com/tomwarren/status/1864326361415925861

    O1 Learning to Reason Benchmarks: https://openai.com/index/learning-to-reason-with-llms/

    SIMA AI: https://arxiv.org/pdf/2404.10179

    Genie Paper: https://arxiv.org/pdf/2402.15391

    My Video on Genie: https://www.youtube.com/watch?v=gGKsfXkSXv8

    Oasis Minecraft: https://x.com/risphereeditor/status/1852619965511204974

    LLMs Procedural Knowledge Paper: https://arxiv.org/pdf/2411.12580

    Bag of Heuristics Paper: https://arxiv.org/pdf/2410.21272

    Jensen Huang Hallucinations: https://www.tomshardware.com/tech-industry/artificial-intelligence/jensen-says-we-are-several-years-away-from-solving-the-ai-hallucination-problem-in-the-meantime-we-have-to-keep-increasing-our-computation

    DeepSeek Interview: https://www.chinatalk.media/p/deepseek-ceo-interview-with-chinas

    Kling Motion Brush: https://klingai.com/image-to-video

    Tim Rocktaschel Book: https://geni.us/ArtificialIntelligence

    00:43 - OpenAI 12 Days, Sora Turbo, o1

    03:06 - Genie 2

    08:26 - Jensen Huang and Altman Hallucination Predictions

    09:45 - Bag of Heuristics Paper

    11:40 - Procedural Knowledge Paper
    13:02 - AssemblyAI Universal 2

    13:45 - SimpleBench QwQ and Chinese Models

    14:42 - Kling Motion Brush



  • A new and mysterious Gemini model appears at the top of the leaderboard, but is that the full story? I dig behind the headline to show you some anti-climactic results, give some context with leaks in the last 48 hours of diminishing returns to scaling, and add the response of Altman, OpenAI and co. The future is about to look a lot stranger...


    80,000 hours Podcast and Channel: https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib
    https://www.youtube.com/@eightythousandhours/videos

    You can now gift memberships to AI Insiders (my Patreon w/ exclusive vids, network): https://www.patreon.com/AIExplained/gift


    ‘There is no wall’: https://x.com/sama/status/1856941766915641580

    https://x.com/vedantmisra/status/1857148554105544708

    Gemini Ranking: https://lmarena.ai/?leaderboard

    API not yet up: https://x.com/OfficialLoganK/status/1857106844805681153

    ‘Just Die Chat’: https://x.com/koltregaskes/status/1856754648146653428

    Google CEO tweet: https://x.com/sundarpichai/status/1857114106928718329

    Sutskever Quote: https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/

    Another OpenAI Staffer Leaves: https://x.com/RichardMCNgo/status/1856843040427839804

    Bloomberg Report: https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai?s=09

    Noam Brown on what OpenAI Researchers Believe: https://x.com/polynoamial/status/1855037689533178289

    Clive Chan: https://x.com/itsclivetime/status/1855704120495329667

    Chollet Responds to Altman: https://x.com/fchollet/status/1857060079586975852

    https://x.com/sama/status/1856940152460869718

    Altman Emails: https://x.com/TechEmails/status/1857285960997712356

    Change of Heart: https://sd11.senate.ca.gov/news/senator-wiener-responds-openai-opposition-sb-1047

    Amodei on ‘Empirical Regularities’: https://lexfridman.com/dario-amodei-transcript/

    Verge Report: https://www.theverge.com/2024/10/25/24279600/google-next-gemini-ai-model-openai-december

    OpenAI Agents in January: https://www.bloomberg.com/news/articles/2024-11-13/openai-nears-launch-of-ai-agents-to-automate-tasks-for-users?srnd=phx-ai

  • The last few days have seen two narratives emerge. One, derived from yesterday’s OpenAI leak in TheInformation, that GPT-5/Orion is a disappointment, and less of a leap than GPT-3 to GPT-4. The second comes from a series of 4 clips (shown in this video) from Sam Altman, regarding the ‘clear path’ to AGI. Let’s go beyond the headlines (and through papers like Frontier Math) to get closer to the ground truth…

    Plus Universal-2, Sora comments, Claude 3.5 Haiku SimpleBench update, and a great new AI video.


    Assembly AI Speech to Text: https://www.assemblyai.com/?utm_source=youtube&utm_medium=influencer&utm_campaign=ai_explained

    00:39 – Bear Case, TheInformation Leak

    04:01 – Bull Case, Sam Altman

    06:20 – FrontierMath

    11:29 – o1 Paradigm

    13:11 – Text to Video Greatness and Universal-2

    TheInformation Leak: https://www.theinformation.com/articles/openai-shifts-strategy-as-rate-of-gpt-ai-improvements-slows?rc=sy0ihq

    Noam Brown Replies: https://x.com/polynoamial/status/1855453104394637444

    Sam Altman Y-Combinator Interview: https://www.youtube.com/watch?v=xXCBz_8hM9w&t=1556s

    Altman Reply: https://x.com/sama/status/1855100359511097828

    https://simple-bench.com/

    FrontierMath Paper: https://arxiv.org/pdf/2411.04872

    Frontier Math Blog Post: https://epochai.org/frontiermath

    Tao: https://x.com/EpochAIResearch/status/1854996368814936250

    MMLU Are We Done (cites me!): https://arxiv.org/pdf/2406.04127

    Universal-2 https://www.assemblyai.com/research/universal-2

    Noam Brown ‘We don’t know’: https://www.youtube.com/watch?v=Gr_eYXdHFis

    Anthropic Founder Response: https://x.com/jackclarkSF/status/1855485569998217231

    Sora (Runway Comment): https://x.com/c_valenzuelab/status/1855026417354129455

    Sora New Vid: https://www.youtube.com/watch?v=_iETa2KDRuw

    Darri3D Video: https://www.reddit.com/r/ChatGPT/comments/1gn0n3z/can_you/

  • The Google destroyer, the Perplexity crusher? Or just hype? ChatGPT with Search is here, and simultaneously Altman and co did an AMA on Reddit, covering GPT-5, Sora, SearchGPT and a lot more. Plus, the biggest news of them all: Simple Bench is out.

    ChatGPT with Search: https://openai.com/index/introducing-chatgpt-search/

    Altman AMA (ask me anything): https://www.reddit.com/r/ChatGPT/comments/1ggixzy/ama_with_openais_sam_altman_kevin_weil_srinivas/

    https://x.com/sama/status/1852041075793522911

    Perplexity Ads: https://www.cnbc.com/2024/08/22/perplexity-ai-plans-to-start-running-search-ads-in-fourth-quarter.html

    Perplexity: https://www.perplexity.ai/

    https://simple-bench.com/