Avsnitt

  • We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch.

    —-

    From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users.

    We discuss:





    The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent



    The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in



    The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves)



    Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities



    The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science



    New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon



    Chapters

    00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey
    00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup
    00:02:47 The Decision to Start a Company: Scaling Beyond Academia
    00:03:38 The $100M Raise: Use of Funds and Platform Economics
    00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics
    00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation
    00:08:12 Educational Value and Learning from the Community
    00:08:41 Technical Migration: From Gradio to React and Platform Evolution
    00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity
    00:12:29 Nano Banana Moment: How Preview Models Create Market Impact
    00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value
    00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity
    00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals
    00:19:10 API Strategy and Focus: Doing One Thing Well
    00:19:51 Community Management and Retention: Sign-In, History, and Daily Value
    00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses
    00:21:49 Hiring and Building a High-Performance Team

  • From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.

    We discuss:





    The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem



    Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon



    Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance



    The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off



    The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression



    Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension



    RL1000 Team (Princeton)





    1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1



    Chapters

    00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience
    00:01:11 Team Introductions and Princeton Research Origins
    00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow
    00:04:35 Self-Supervised RL: A Different Approach to Scaling
    00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth
    00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients
    00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives
    00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning
    00:09:44 From TD Errors to Classification: Why This Objective Scales
    00:11:06 Architecture Details: Building on Braw and SymbaFowl
    00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision
    00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling
    00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure
    00:18:05 World Models and Next State Classification
    00:22:37 Unlocking Batch Size Scaling Through Network Capacity
    00:24:10 Compute Requirements: State-of-the-Art on a Single GPU
    00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning
    00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling

  • Saknas det avsnitt?

    Klicka här för att uppdatera flödet manuellt.

  • From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.

    We discuss:





    John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks



    The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number")



    SWE-bench Verified: the curated, high-quality split that became the standard for serious evals



    SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution



    The SWE-bench Pro controversy: independent authors used the "SWE-bench" name without John's blessing, but he's okay with it ("congrats to them, it's a great benchmark")



    CodeClash: John's new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)



    SWE-Efficiency (Jeffrey Maugh, John's high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)



    AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)



    The Tau-bench "impossible tasks" debate: some tasks are underspecified or impossible, but John thinks that's actually a feature (flags cheating if you score above 75%)



    Cognition's research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)



    The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve



    John Yang





    SWE-bench: https://www.swebench.com



    X: https://x.com/jyangballin



    Chapters

    00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations
    00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race
    00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants
    00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories
    00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments
    00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas
    00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing
    00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation
    00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity
    00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration
    00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research

  • From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks.

    We discuss:





    Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model



    Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?"



    The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly



    How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish



    The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness)



    Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes)



    The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows



    Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model "just a tool"



    The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge



    Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length



    Why the education system isn't producing enough people who can do both distributed systems and ML research, and why that's the bottleneck for frontier labs



    The 2026 vision: neither pre-training nor post-training is dead, we're in the fog of war, and the bottleneck will keep moving (so emotional stability helps)



    Josh McGrath





    OpenAI: https://openai.com



    https://x.com/j_mcgraph



    Chapters

    00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI
    00:04:37 The Shopping Model: Black Friday Launch and Interruptability
    00:07:11 Model Personality and the Anton vs Clippy Divide
    00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL
    00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training
    00:13:12 Token Efficiency: The 2D Plot That Matters Most
    00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting
    00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context
    00:21:23 The ML-Systems Hybrid: What's Hard to Hire For
    00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution

  • From Berkeley robotics and OpenAI's 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI's reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn't change the world when o1 actually achieved it, how RL doesn't generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn't pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity.

    We discuss:





    Ashvin's path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months



    Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman's take)



    The IOI Gold paradox: "If you told me we'd achieve IOI Gold in 2022, I'd assume we could all go on vacation—AI solved, no point working anymore. But life is still the same."



    The RL research era (2017–2022) and why most of it didn't pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize



    Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing "surprisingly accurate reasoning traces" on math, and first-principles belief that scaled



    The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up



    Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product



    Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily



    The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room)



    Why off-policy RL is unstable (Ashvin's favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews



    The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice



    Ashvin Nair





    Cursor: https://cursor.com



    X: https://x.com/ashvinnair_



    Chapters

    00:00:00 Introduction: From Robotics to Cursor via OpenAI
    00:01:58 The Robotics to LLM Agent Transition: Why Code Won
    00:09:11 RL Research Winter and Academic Overfitting
    00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI
    00:21:30 OpenAI's Reasoning Journey: From Codex to O1
    00:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance
    00:22:39 RL for Reasoning: The O-Series Conviction and Scaling
    00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles
    00:33:07 Why Cursor: Co-Designing Products and Models for Real Work
    00:34:14 Composer and the Future: Online Learning Every Two Hours
    00:35:15 Continual Learning: The Missing Paradigm Shift
    00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable

  • From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it's not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before.

    We discuss:





    The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories



    How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads



    Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran



    The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal ("we're a unicorn") over partnership or dilution discipline



    Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category



    The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules)



    Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable



    Sarah's investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before



    Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale



    Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor



    Sarah Catanzaro





    X: https://x.com/sarahcat21



    Amplify Partners: https://amplifypartners.com/

    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI
    00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack
    00:05:26 Data Catalogs and What Went Wrong
    00:08:16 Data Infrastructure at AI Labs: Surprising Insights
    00:10:13 The Crazy Funding Environment of 2024-2025
    00:17:18 World Models: Hype, Confusion, and Market Potential
    00:18:59 Memory Management and Continual Learning: The Next Frontier
    00:23:27 Agent Environments: Just a Fad?
    00:25:48 The Perfect AI Startup: Research Meets Application
    00:28:02 Closing Thoughts and Where to Find Sarah

  • One year ago, Anthropic launched the Model Context Protocol (MCP)—a simple, open standard to connect AI applications to the data and tools they need. Today, MCP has exploded from a local-only experiment into the de facto protocol for agentic systems, adopted by OpenAI, Microsoft, Google, Block, and hundreds of enterprises building internal agents at scale. And now, MCP is joining the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, alongside Block's Goose coding agent, with founding members spanning the biggest names in AI and cloud infrastructure.

    We sat down with David Soria Parra (MCP lead, Anthropic), Nick Cooper (OpenAI), Brad Howes (Block / Goose), and Jim Zemlin (Linux Foundation CEO) to dig into the one-year journey of MCP—from Thanksgiving hacking sessions and the first remote authentication spec to long-running tasks, MCP Apps, and the rise of agent-to-agent communication—and the behind-the-scenes story of how three competitive AI labs came together to donate their protocols and agents to a neutral foundation, why enterprises are deploying MCP servers faster than anyone expected (most of it invisible, internal, and at massive scale), what it takes to design a protocol that works for both simple tool calls and complex multi-agent orchestration, how the foundation will balance taste-making (curating meaningful projects) with openness (avoiding vendor lock-in), and the 2025 vision: MCP as the communication layer for asynchronous, long-running agents that work while you sleep, discover and install their own tools, and unlock the next order of magnitude in AI productivity.

    We discuss:





    The one-year MCP journey: from local stdio servers to remote HTTP streaming, OAuth 2.1 authentication (and the enterprise lessons learned), long-running tasks, and MCP Apps (iframes for richer UI)



    Why MCP adoption is exploding internally at enterprises: invisible, internal servers connecting agents to Slack, Linear, proprietary data, and compliance-heavy workflows (financial services, healthcare)



    The authentication evolution: separating resource servers from identity providers, dynamic client registration, and why the March spec wasn't enterprise-ready (and how June fixed it)



    How Anthropic dogfoods MCP: internal gateway, custom servers for Slack summaries and employee surveys, and why MCP was born from "how do I scale dev tooling faster than the company grows?"



    Tasks: the new primitive for long-running, asynchronous agent operations—why tools aren't enough, how tasks enable deep research and agent-to-agent handoffs, and the design choice to make tasks a "container" (not just async tools)



    MCP Apps: why iframes, how to handle styles and branding, seat selection and shopping UIs as the killer use case, and the collaboration with OpenAI to build a common standard



    The registry problem: official registry vs. curated sub-registries (Smithery, GitHub), trust levels, model-driven discovery, and why MCP needs "npm for agents" (but with signatures and HIPAA/financial compliance)



    The founding story of AAIF: how Anthropic, OpenAI, and Block came together (spoiler: they didn't know each other were talking to Linux Foundation), why neutrality matters, and how Jim Zemlin has never seen this much day-one inbound interest in 22 years



    David Soria Parra (Anthropic / MCP)





    MCP: https://modelcontextprotocol.io



    https://uk.linkedin.com/in/david-soria-parra-4a78b3a



    https://x.com/dsp_

    Nick Cooper (OpenAI)





    X: https://x.com/nicoaicopr

    Brad Howes (Block / Goose)





    Goose: https://github.com/block/goose

    Jim Zemlin (Linux Foundation)





    LinkedIn: https://www.linkedin.com/in/zemlin/

    Agentic AI Foundation





    https://agenticai.foundation



    Chapters

    00:00:00 Introduction: MCP's First Year and Foundation Launch
    00:01:17 MCP's Journey: From Launch to Industry Standard
    00:02:06 Protocol Evolution: Remote Servers and Authentication
    00:08:52 Enterprise Authentication and Financial Services
    00:11:42 Transport Layer Challenges: HTTP Streaming and Scalability
    00:15:37 Standards Development: Collaboration with Tech Giants
    00:34:27 Long-Running Tasks: The Future of Async Agents
    00:30:41 Discovery and Registries: Building the MCP Ecosystem
    00:30:54 MCP Apps and UI: Beyond Text Interfaces
    00:26:55 Internal Adoption: How Anthropic Uses MCP
    00:23:15 Skills vs MCP: Complementary Not Competing
    00:36:16 Community Events and Enterprise Learnings
    01:03:31 Foundation Formation: Why Now and Why Together
    01:07:38 Linux Foundation Partnership: Structure and Governance
    01:11:13 Goose as Reference Implementation
    01:17:28 Principles Over Roadmaps: Composability and Quality
    01:21:02 Foundation Value Proposition: Why Contribute
    01:27:49 Practical Investments: Events, Tools, and Community
    01:34:58 Looking Ahead: Async Agents and Real Impact

  • Note: Steve and Gene’s talk on Vibe Coding and the post IDE world was one of the top talks of AIE CODE: https://www.youtube.com/watch?v=7Dtu2bilcFs&t=1019s&pp=0gcJCU0KAYcqIYzv



    From building legendary platforms at Google and Amazon to authoring one of the most influential essays on AI-powered development (Revenge of the Junior Developer, quoted by Dario Amodei himself), Steve Yegge has spent decades at the frontier of software engineering—and now he's leading the charge into what he calls the "factory farming" era of code. After stints at SourceGraph and building Beads (a purely vibe-coded issue tracker with tens of thousands of users), Steve co-authored The Vibe Coding Book and is now building VC (VibeCoder), an agent orchestration dashboard designed to move developers from writing code to managing fleets of AI agents that coordinate, parallelize, and ship features while you sleep.

    We sat down with Steve at AI Engineer Summit to dig into why Claude Code, Cursor, and the entire 2024 stack are already obsolete, what it actually takes to trust an agent after 2,000 hours of practice (hint: they will delete your production database if you anthropomorphize them), why the real skill is no longer writing code but orchestrating agents like a NASCAR pit crew, how merging has become the new wall that every 10x-productive team is hitting (and why one company's solution is literally "one engineer per repo"), the rise of multi-agent workflows where agents reserve files, message each other via MCP, and coordinate like a little village, why Steve believes if you're still using an IDE to write code by January 1st, you're a bad engineer, how the 12–15 year experience bracket is the most resistant demographic (and why their identity is tied to obsolete workflows), the hidden chaos inside OpenAI, Anthropic, and Google as they scale at breakneck speed, why rewriting from scratch is now faster than refactoring for a growing class of codebases, and his 2025 prediction: we're moving from subsistence agriculture to John Deere-scale factory farming of code, and the Luddite backlash is only just beginning.

    We discuss:





    Why Claude Code, Cursor, and agentic coding tools are already last year's tech—and what comes next: agent orchestration dashboards where you manage fleets, not write lines



    The 2,000-hour rule: why it takes a full year of daily use before you can predict what an LLM will do, and why trust = predictability, not capability



    Steve's hot take: if you're still using an IDE to develop code by January 1st, 2025, you're a bad engineer—because the abstraction layer has moved from models to full-stack agents



    The demographic most resistant to vibe coding: 12–15 years of experience, senior engineers whose identity is tied to the way they work today, and why they're about to become the interns



    Why anthropomorphizing LLMs is the biggest mistake: the "hot hand" fallacy, agent amnesia, and how Steve's agent once locked him out of prod by changing his password to "fix" a problem



    Should kids learn to code? Steve's take: learn to vibe code—understand functions, classes, architecture, and capabilities in a language-neutral way, but skip the syntax



    The 2025 vision: "factory farming of code" where orchestrators run Cloud Code, scrub output, plan-implement-review-test in loops, and unlock programming for non-programmers at scale



    Steve Yegge





    X: https://x.com/steve_yegge



    Substack (Stevie's Tech Talks): https://steve-yegge.medium.com/



    GitHub (VC / VibeCoder): https://github.com/yegge-labs

    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction: Steve Yegge on Vibe Coding and AI Engineering
    00:00:59 The Backlash: Who Resists Vibe Coding and Why
    00:04:26 The 2000 Hour Rule: Building Trust with AI Coding Tools
    00:03:31 The January 1st Deadline: IDEs Are Becoming Obsolete
    00:02:55 10X Productivity at OpenAI: The Performance Review Problem
    00:07:49 The Hot Hand Fallacy: When AI Agents Betray Your Trust
    00:11:12 Claude Code Isn't It: The Need for Agent Orchestration
    00:15:20 The Orchestrator Revolution: From Cloud Code to Agent Villages
    00:18:46 The Merge Wall: The Biggest Unsolved Problem in AI Coding
    00:26:33 Never Rewrite Your Code - Until Now: Joel Spolsky Was Wrong
    00:22:43 Factory Farming Code: The John Deere Era of Software
    00:29:27 Google's Gemini Turnaround and the AI Lab Chaos
    00:33:20 Should Your Kids Learn to Code? The New Answer
    00:34:59 Code MCP and the Gossip Rate: Latest Vibe Coding Discoveries

  • From the frontlines of OpenAI's Codex and GPT-5 training teams, Bryan and Bill are building the future of AI-powered coding—where agents don't just autocomplete, they architect, refactor, and ship entire features while you sleep. We caught up with them at AI Engineer Conference right after the launch of Codex Max, OpenAI's newest long-running coding agent designed to work for 24+ hours straight, manage its own context, and spawn sub-agents to parallelize work across your entire codebase.

    We sat down with Bryan and Bill to dig into what it actually takes to train a model that developers trust—why personality, communication, and planning matter as much as raw capability, how Codex is trained with strong opinions about tools (it loves rg over grep, seriously), why the abstraction layer is moving from models to full-stack agents you can plug into VS Code or Zed, how OpenAI partners co-develop tool integrations and discover unexpected model habits (like renaming tools to match Codex's internal training), the rise of applied evals that measure real-world impact instead of academic benchmarks, why multi-turn evals are the next frontier (and Bryan's "job interview eval" idea), how coding agents are breaking out of code into personal automation, terminal workflows, and computer use, and their 2026 vision: coding agents trusted enough to handle the hardest refactors at any company, not just top-tier firms, and general enough to build integrations, organize your desktop, and unlock capabilities you'd never get access to otherwise.

    We discuss:





    What Codex Max is: a long-running coding agent that can work 24+ hours, manage its own context window, and spawn sub-agents for parallel work



    Why the name "Max": maximalist, maximization, speed and endurance—it's simply better and faster for the same problems



    Training for personality: communication, planning, context gathering, and checking your work as behavioral characteristics, not just capabilities



    How Codex develops habits like preferring rg over grep, and why renaming tools to match its training (e.g., terminal-style naming) dramatically improves tool-call performance



    The split between Codex (opinionated, agent-focused, optimized for the Codex harness) and GPT-5 (general, more durable across different tools and modalities)



    Why the abstraction layer is moving up: from prompting models to plugging in full agents (Codex, GitHub Copilot, Zed) that package the entire stack



    The rise of sub-agents and agents-using-agents: Codex Max spawning its own instances, handing off context, and parallelizing work across a codebase



    How OpenAI works with coding partners on the bleeding edge to co-develop tool integrations and discover what the model is actually good at



    The shift to applied evals: capturing real-world use cases instead of academic benchmarks, and why ~50% of OpenAI employees now use Codex daily



    Why multi-turn evals are the next frontier: LM-as-a-judge for entire trajectories, Bryan's "job interview eval" concept, and the need for a batch multi-turn eval API



    How coding agents are breaking out of code: personal automation, organizing desktops, terminal workflows, and "Devin for non-coding" use cases



    Why Slack is the ultimate UI for work, and how coding agents can become your personal automation layer for email, files, and everything in between



    The 2026 vision: more computer use, more trust, and coding agents capable enough that any company can access top-tier developer capabilities, not just elite firms



    Bryan & Bill (OpenAI Codex Team)





    http://x.com/bfioca



    https://x.com/realchillben



    OpenAI Codex: https://openai.com/index/openai-codex/

    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction: Latent Space Listeners at AI Engineer Code
    00:01:27 Codex Max Launch: Training for Long-Running Coding Agents
    00:03:01 Model Personality and Trust: Communication, Planning, and Self-Checking
    00:05:20 Codex vs GPT-5: Opinionated Agents vs General Models
    00:07:47 Tool Use and Model Habits: The Ripgrep Discovery
    00:09:16 Personality Design: Verbosity vs Efficiency in Coding Agents
    00:11:56 The Agent Abstraction Layer: Building on Top of Codex
    00:14:08 Sub-Agents and Multi-Agent Patterns: The Future of Composition
    00:16:11 Trust and Adoption: OpenAI Developers Using Codex Daily
    00:17:21 Applied Evals: Real-World Testing vs Academic Benchmarks
    00:19:15 Multi-Turn Evals and the Job Interview Pattern
    00:21:35 Feature Request: Batch Multi-Turn Eval API
    00:22:28 Beyond Code: Personal Automation and Computer Use
    00:24:51 Vision-Native Agents and the UI Integration Challenge
    00:25:02 2026 Predictions: Trust, Computer Use, and Democratized Excellence

  • as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)

    From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980784425931067?s=46), SAM can now even segment audio output!

    We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.

    We discuss:





    What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can"



    How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly



    Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking



    The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity



    The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2



    Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale



    Architecture innovations: presence token to separate recognition ("is it in the image?") from localization ("where is it?"), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking



    Building on Meta's ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2's memory-based tracking backbone



    SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like "find the bigger character" or "what distinguishes male from female in this image"



    Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples



    Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more



    MSL FAIR team





    Nikhila: https://www.linkedin.com/in/nikhilaravi/



    Pengchuan: https://pzzhang.github.io/pzzhang/

    Joseph Nelson





    X: https://x.com/josephofiowa



    LinkedIn: https://www.linkedin.com/in/josephofiowa/



    [FLIGHTCAST_CHATPERS]

  • Note: this is Pliny and John’s first major podcast. Voices have been changed for opsec.

    From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-source AI security, Pliny the Liberator and John V are redefining what AI red-teaming looks like when you refuse to lobotomize models in the name of "safety."

    Pliny built his reputation crafting universal jailbreaks—skeleton keys that obliterate guardrails across modalities—and open-sourcing prompt templates like Libertas, predictive reasoning cascades, and the infamous "Pliny divider" that's now embedded so deep in model weights it shows up unbidden in WhatsApp messages. John V, coming from prompt engineering and computer vision, co-founded the Bossy Discord (40,000 members strong) and helps steer BT6's ethos: if you can't open-source the data, we're not interested. Together they've turned down enterprise gigs, pushed back on Anthropic's closed bounties, and insisted that real AI security happens at the system layer—not by bubble-wrapping latent space.

    We sat down with Pliny and John to dig into the mechanics of hard vs. soft jailbreaks, why multi-turn crescendo attacks were obvious to hackers years before academia "discovered" them, how segmented sub-agents let one jailbroken orchestrator weaponize Claude for real-world attacks (exactly as Pliny predicted 11 months before Anthropic's recent disclosure), why guardrails are security theater that punishes capability while doing nothing for real safety, the role of intuition and "bonding" with models to navigate latent space, how BT6 vets operators on skill and integrity, why they believe Mech Interp and open-source data are the path forward (not RLHF lobotomization), and their vision for a future where spatial intelligence, swarm robotics, and AGI alignment research happen in the open—bootstrapped, grassroots, and uncompromising.

    We discuss:





    What universal jailbreaks are: skeleton-key prompts that obliterate guardrails across models and modalities, and why they're central to Pliny's mission of "liberation"



    Hard vs. soft jailbreaks: single-input templates vs. multi-turn crescendo attacks, and why the latter were obvious to hackers long before academic papers



    The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing "steered chaos" pulls models out-of-distribution



    Why jailbreaking is 99% intuition and bonding with the model: probing token layers, syntax hacks, multilingual pivots, and forming a relationship to navigate latent space



    The Anthropic Constitutional AI challenge drama: UI bugs, judge failures, goalpost moving, the demand for open-source data, and why Pliny sat out the $30k bounty



    Why guardrails ≠ safety: security theater, the futility of locking down latent space when open-source is right behind, and why real safety work happens in meatspace (not RLHF)



    The weaponization of Claude: how segmented sub-agents let one jailbroken orchestrator execute malicious tasks (pyramid-builder analogy), and why Pliny predicted this exact TTP 11 months before Anthropic's disclosure



    BT6 hacker collective: 28 operators across two cohorts, vetted on skill and integrity, radical transparency, radical open-source, and the magic of moving the needle on AI security, swarm intelligence, blockchain, and robotics



    Pliny the Liberator





    X: https://x.com/elder_plinius



    GitHub (Libertas): https://github.com/elder-plinius/L1B3RT45

    John V





    X: https://x.com/JohnVersus

    BT6 & Bossy





    BT6: https://bt6.gg



    Bossy Discord: Search "Bossy Discord" or ask Pliny/John V on X



    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction: Meet Pliny the Liberator and John V
    00:01:50 The Philosophy of AI Liberation and Jailbreaking
    00:03:08 Universal Jailbreaks: Skeleton Keys to AI Models
    00:04:24 The Cat-and-Mouse Game: Attackers vs Defenders
    00:05:42 Security Theater vs Real Safety: The Fundamental Disconnect
    00:08:51 Inside the Libertas Repo: Prompt Engineering as Art
    00:16:22 The Anthropic Challenge Drama: UI Bugs and Open Source Data
    00:23:30 From Jailbreaks to Weaponization: AI-Orchestrated Attacks
    00:26:55 The BT6 Hacker Collective and BASI Community
    00:34:46 AI Red Teaming: Full Stack Security Beyond the Model
    00:38:06 Safety vs Security: Meat Space Solutions and Final Thoughts

  • Glean started as a Kleiner Perkins incubation and is now a $7B, $200m ARR Enterprise AI leader. Now KP has tapped its own podcaster to lead it’s next big swing.

    From building go-to-market the hard way in startups (and scaling Palo Alto Networks’ public cloud business) to joining Kleiner Perkins to help technical founders turn product edge into repeatable revenue, Joubin Mirzadegan has spent the last decade obsessing over one thing: distribution and how ideas actually spread, sell, and compound. That obsession took him from launching the CRO-only podcast Grit (https://www.youtube.com/playlist?list=PLRiWZFltuYPF8A6UGm74K2q29UwU-Kk9k) as a hiring wedge, to working alongside breakout companies like Glean and Windsurf, to now incubating Roadrunner which is an AI-native rethink of CPQ and quoting workflows as pricing models collapse from “seats” into consumption, bundles, renewals, and SKU sprawl.

    We sat down with Joubin to dig into the real mechanics of making conversations feel human (rolling early, never sending questions, temperature + lighting hacks), what Windsurf got right about “Google-class product and Salesforce-class distribution,” how to hire early sales leaders without getting fooled by shiny logos, why CPQ is quietly breaking the back of modern revenue teams, and his thesis for his new company and KP incubation Roadrunner (https://www.roadrunner.ai/): rebuild the data model from the ground up, co-develop with the hairiest design partners, and eventually use LLMs to recommend deal structures the way the best reps do without the Slack-channel chaos of deal desk.

    We discuss:





    How to make guests instantly comfortable: rolling early, no “are you ready?”, temperature, lighting, and room dynamics



    Why Joubin refuses to send questions in advance (and when you might have to anyway)



    The origin of the CRO-only podcast: using media as a hiring wedge and relationship engine



    The “commit to 100 episodes” mindset: why most shows die before they find their voice



    Founder vs exec interviews: why CEOs can speak more freely (and what it unlocks in conversation)



    What Glean taught him about enterprise AI: permissions, trust, and overcoming “category is dead” skepticism



    Design partners as the real unlock: why early believers matter and how co-development actually works



    Windsurf’s breakout: what it means to be serious about “Google-class product + Salesforce-class distribution”



    Why technical founders struggle with GTM and how KP built a team around sales, customer access, and demand gen



    Hiring early sales leaders: anti-patterns (logos), what to screen for (motivation), and why stage-fit is everything



    The CPQ problem & Roadrunner’s thesis: rebuilding CPQ/quoting from the data model up for modern complexity



    How “rules + SKUs + approvals” create a brittle graph and what it takes to model it without tipping over



    The two-year window: incumbents rebuilding slowly vs startups out-sprinting with AI-native architecture



    Where AI actually helps: quote generation, policy enforcement, approval routing, and deal recommendation loops



    Joubin





    X: https://x.com/Joubinmir



    LinkedIn: https://www.linkedin.com/in/joubin-mirzadegan-66186854/



    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction and the Zuck Interview Experience
    00:03:26 The Genesis of the Grit Podcast: Hiring CROs Through Content
    00:13:20 Podcast Philosophy: Creating Authentic Conversations
    00:15:44 Working with Arvind at Glean: The Enterprise Search Breakthrough
    00:26:20 Windsurf's Sales Machine: Google-Class Product Meets Salesforce-Class Distribution
    00:30:28 Hiring Sales Leaders: Anti-Patterns and First Principles
    00:39:02 The CPQ Problem: Why Salesforce and Legacy Tools Are Breaking
    00:43:40 Introducing Roadrunner: Solving Enterprise Pricing with AI
    00:49:19 Building Roadrunner: Team, Design Partners, and Data Model Challenges
    00:59:35 High Performance Philosophy: Working Out Every Day and Reducing Friction
    01:06:28 Defining Grit: Passion Plus Perseverance

  • From applied cryptography and offensive security in France’s defense industry to optimizing nuclear submarine workflows, then selling his e-signature startup to Docusign (https://www.docusign.com/company/news-center/opentrust-joins-docusign-global-trust-network and now running AI as CTO of Superhuman Mail (Superhuman, recently acquired by Grammarly https://techcrunch.com/2025/07/01/grammarly-acquires-ai-email-client-superhuman/), Loïc Houssier has lived the full arc from deep infra and compliance hell to obsessing over 100ms product experiences and AI-native email. We sat down with Loïc to dig into how you actually put AI into an inbox without adding latency, why Superhuman leans so hard into agentic search and “Ask AI” over your entire email history, how they design tools vs. agents and fight agent laziness, what box-priced inference and local-first caching mean for cost and reliability, and his bet that your inbox will power your future AI EA while AI massively widens the gap between engineers with real fundamentals and those faking it.



    We discuss:





    Loïc’s path from applied cryptography and offensive security in France’s defense industry to submarines, e-signatures, Docusign, and now Superhuman Mail





    What 3,000+ engineers actually do at a “simple” product like Docusign: regional compliance, on-prem appliances, and why global scale explodes complexity



    How Superhuman thinks about AI in email: auto-labels, smart summaries, follow-up nudges, “Ask AI” search, and the rule that AI must never add latency or friction



    Superhuman’s agentic framework: tools vs. agents, fighting “agent laziness,” deep semantic search over huge inboxes, and pagination strategies to find the real needle in the haystack



    How they evaluate OpenAI, Anthropic, Gemini, and open models: canonical queries, end-to-end evals, date reasoning, and Rahul’s infamous “what wood was my table?” test



    Infra and cost philosophy: local-first caching, vector search backends, Baseten “box” pricing vs. per-token pricing, and thinking in price-per-trillion-tokens instead of price-per-million



    The vision of Superhuman as your AI EA: auto-drafting replies in your voice, scheduling on your behalf, and using your inbox as the ultimate private data source



    How the Grammarly + Coda + Superhuman stack could power truly context-aware assistance across email, docs, calendars, contracts, and more



    Inside Superhuman’s AI-dev culture: free-for-all tool adoption, tracking AI usage on PRs, and going from ~4 to ~6 PRs per engineer per week



    Why Loïc believes everyone should still learn to code, and how AI will amplify great engineers with strong fundamentals while exposing shallow ones even faster



    Loïc Houssier





    LinkedIn: https://www.linkedin.com/in/houssier/

    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction and Loïc's Journey from Nuclear Submarines to Superhuman
    00:06:40 Docusign Acquisition and the Enterprise Email Stack
    00:10:26 Superhuman's AI Vision: Your Inbox as the Real AI Agent
    00:13:20 Ask AI: Agentic Search and the Quality Problem
    00:18:20 Infrastructure Choices: Model Selection, Base10, and Cost Management
    00:27:30 Local-First Architecture and the Database Stack
    00:30:50 Evals, Quality, and the Rahul Wood Table Test
    00:42:30 The Future EA: Auto-Drafting and Proactive Assistance
    00:46:40 Grammarly Acquisition and the Contextual Advantage
    00:38:40 Voice, Video, and the End of Writing
    00:51:40 Knowledge Graphs: The Hard Problem Nobody Has Solved
    00:56:40 Competing with OpenAI and the Browser Question
    01:02:30 AI Coding Tools: From 4 to 6 PRs Per Week
    01:08:00 Engineering Culture, Hiring, and the Future of Software Development

  • From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs.

    We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world.

    We discuss:





    How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models



    Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans



    Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe



    Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation



    Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players



    Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab



    How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent



    GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API



    Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events



    The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world



    Pim





    X: https://x.com/PimDeWitte



    LinkedIn: https://www.linkedin.com/in/pimdw/



    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction and Medal's Gaming Data Advantage
    00:02:08 Exclusive Demo: Vision-Based Gaming Agents
    00:06:17 Action Prediction and Real-World Video Transfer
    00:08:41 World Models: Interactive Video Generation
    00:13:42 From Runescape to AI: Pim's Founder Journey
    00:16:45 The Research Foundations: Diamond, Genie, and SEMA
    00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI
    00:35:04 Data Moats and Why GI Stayed Independent
    00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course
    00:40:28 Defining World Models vs Video Generation
    00:41:52 Why Simulation Complexity Favors World Models
    00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race
    00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships
    00:58:57 From Imitation Learning to RL: Making Clips Playable
    01:00:15 Open Research, Academic Partnerships, and Hiring
    01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions

  • Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D.

    We discuss:





    The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone.



    What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets.



    Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in.



    Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning.



    The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem.



    Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters.



    Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots.



    How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world.



    Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making.



    Fei-Fei Li





    X: https://x.com/drfeifei



    LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247

    Justin Johnson





    X: https://x.com/jcjohnss



    LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664



    Where to find Latent Space





    X: https://x.com/latentspacepod



    Substack: https://www.latent.space/



    Chapters

    00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership
    00:02:00 From ImageNet to World Models: The Evolution of Computer Vision
    00:12:42 Dense Captioning and Early Vision-Language Work
    00:19:57 Spatial Intelligence: Beyond Language Models
    00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model
    00:33:21 Gaussian Splats and the Technical Architecture of Marble
    00:22:10 Physics, Dynamics, and the Future of World Models
    00:41:09 Multimodality and the Interplay of Language and Space
    00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI
    00:56:58 Hiring, Research Directions, and the Future of World Labs

  • Alex Lieberman and Arman Hezarkani, co-founders of Tenex, reveal how they're revolutionizing software consulting by compensating AI engineers for output rather than hours—enabling some engineers to earn over $1 million annually while delivering 10x productivity gains. Their company represents a fundamental rethinking of knowledge work compensation in the age of AI agents, where traditional hourly billing models perversely incentivize slower work even as AI tools enable unprecedented speed.

    The Genesis: From 90% Downsizing to 10x Output The story behind 10X begins with Arman's previous company, Parthian, where he was forced to downsize his engineering team by 90%. Rather than collapse, Arman re-architected the entire product and engineering process to be AI-first—and discovered that production-ready software output increased 10x despite the massive headcount reduction. This counterintuitive result exposed a fundamental misalignment: engineers compensated by the hour are disincentivized from leveraging AI to work faster, even when the technology enables dramatic productivity gains. Alex, who had invested in Parthian, initially didn't believe the numbers until Arman walked him through why LLMs have made such a profound impact specifically on engineering as knowledge work.

    The Economic Model: Story Points Over Hours 10X's core innovation is compensating engineers based on story points—units of completed, quality output—rather than hours worked. This creates direct economic incentives for engineers to adopt every new AI tool, optimize their workflows, and maximize throughput. The company expects multiple engineers to earn over $1 million in cash compensation next year purely from story point earnings. To prevent gaming the system, they hire for two profiles: engineers who are "long-term selfish" (understanding that inflating story points will destroy client relationships) and those who genuinely love writing code and working with smart people. They also employ technical strategists incentivized on client retention (NRR) who serve as the final quality gate before any engineering plan reaches a client.

    Impressive Builds: From Retail AI to App Store Hits The results speak for themselves. In one project, 10X built a computer vision system for retail cameras that provides heat maps, queue detection, shelf stocking analysis, and theft detection—creating early prototypes in just two weeks for work that previously took quarters. They built Snapback Sports' mobile trivia app in one month, which hit 20th globally on the App Store. In a sales context, an engineer spent four hours building a working prototype of a fitness influencer's AI health coach app after the prospect initially said no—immediately moving 10X to the top of their vendor list. These examples demonstrate how AI-enabled speed fundamentally changes sales motions and product development timelines.

    The Interview Process: Unreasonably Difficult Take-Homes Despite concerns that AI would make take-home assessments obsolete, 10X still uses them—but makes them "unreasonably difficult." About 50% of candidates don't even respond, but those who complete the challenge demonstrate the caliber needed. The interview process is remarkably short: two calls before the take-home, review, then one or two final meetings—completable in as little as a week. A signature question: "If you had infinite resources to build an AI that could replace either of us on this call, what would be the first major bottleneck?" The sophisticated answer isn't just "model intelligence" or "context length"—it's controlling entropy, the accumulating error rate that derails autonomous agents over time.

    The Limiting Factor: Human Capital, Not Technology Despite being an AI-first company, 10X's primary constraint is human capital—finding and hiring enough exceptional engineers fast enough, then matching them with the right processes to maintain delivery quality as they scale. The company has ambitions beyond consulting to build their own technology, but for the foreseeable future, recruiting remains the bottleneck. This reveals an important insight about the AI era: even as technology enables unprecedented leverage, the constraint shifts to finding people who can harness that leverage effectively.

    Chapters

    00:00:00 Introduction and Meeting the 10X Co-founders
    00:01:29 The 10X Moment: From Hourly Billing to Output-Based Compensation
    00:04:44 The Economic Model Behind 10X
    00:05:42 Story Points and Measuring Engineering Output
    00:08:41 Impressive Client Projects and Rapid Prototyping
    00:12:22 The 10X Tech Stack: TypeScript and High Structure
    00:13:21 AI Coding Tools: The Daily Evolution
    00:15:05 Human Capital as the Limiting Factor
    00:16:02 The Unreasonably Difficult Interview Process
    00:17:14 Entropy and Context Engineering: The Future of AI Agents
    00:23:28 The MCP Debate and AI Industry Sociology
    00:26:01 Consulting, Digital Transformation, and Conference Insights

  • Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering.

    We cover Glean’s rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools.



    Chapters

    00:00:00 Introduction and Deedy's Return to Latent Space
    00:01:20 Glean's Journey: From Boring Enterprise Search to $7B Valuation
    00:15:37 Anthropic's Meteoric Rise and Market Share Dynamics
    00:17:50 Claude Artifacts and Product Innovation
    00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem
    00:48:01 Goodfire and Mechanistic Interpretability
    00:51:25 Prime Intellect and Distributed AI Training
    00:53:40 OpenRouter: Building the AI Model Gateway
    01:13:36 The Stargate Project and Infrastructure Arms Race
    01:18:14 The Future of Software Engineering and AI Coding

  • Jared Palmer, SVP at GitHub and VP of CoreAI at Microsoft, joins Latent Space for an in-depth look at the evolution of coding agents and modern developer tools. Recently joining after leading AI initiatives at Vercel, Palmer shares firsthand insights from behind the scenes at GitHub Universe, including the launch of Agent HQ which is a new collaboration hub for coding agents and developers.

    This episode traces Palmer’s journey from building Copilot inspired tools to pioneering the focused Next.js coding agent, v0, and explores how platform constraints fostered rapid experimentation and a breakout success in AI-powered frontend development. Palmer explains the unique advantages of GitHub’s massive developer network, the challenges of scaling agent-based workflows, and why integrating seamless AI into developer experiences is now a top priority for both Microsoft and GitHub.

  • Jed Borovik, Product Lead at Google Labs, joins Latent Space to unpack how Google is building the future of AI-powered software development with Jules. From his journey discovering GenAI through Stable Diffusion to leading one of the most ambitious coding agent projects in tech, Borovik shares behind-the-scenes insights into how Google Labs operates at the intersection of DeepMind's model development and product innovation.

    We explore Jules' approach to autonomous coding agents and why they run on their own infrastructure, how Google simplified their agent scaffolding as models improved, and why embeddings-based RAG is giving way to attention-based search. Borovik reveals how developers are using Jules for hours or even days at a time, the challenges of managing context windows that push 2 million tokens, and why coding agents represent both the most important AI application and the clearest path to AGI.

    This conversation reveals Google's positioning in the coding agent race, the evolution from internal tools to public products, and what founders, developers, and AI engineers should understand about building for a future where AI becomes the new brush for software engineering.



    Chapters

    00:00:00 Introduction and GitHub Universe Recap
    00:00:57 New York Tech Scene and East Coast Hackathons
    00:02:19 From Google Search to AI Coding: Jed's Journey
    00:04:19 Google Labs Mission and DeepMind Collaboration
    00:06:41 Jules: Autonomous Coding Agents Explained
    00:09:39 The Evolution of Agent Scaffolding and Model Quality
    00:11:30 RAG vs Attention: The Shift in Code Understanding
    00:13:49 Jules' Journey from Preview to Production
    00:15:05 AI Engineer Summit: Community Building and Networking
    00:25:06 Context Management in Long-Running Agents
    00:29:02 The Future of Software Engineering with AI
    00:36:26 Beyond Vibe Coding: Spec Development and Verification
    00:40:20 Multimodal Input and Computer Use for Coding Agents

  • Today’s guests are Priscilla Chan and Mark Zuckerberg, co-founders of Biohub (fka Chan Zuckerberg Initiative). They are one of the leading institutes for AI x Bio and open science research with projects like CELLxGENE, rbio1, VariantFormer, and many more. We talked about the evolution from a broad philanthropic institute to specializing in frontier AI + bio, why they are building 12ft tall microscopes to gather better data, and how building a virtual cell model + virtual immune system could potentially help us cure all diseases.

    Chapters

    00:00:00 Introduction and CZI's 10-Year Anniversary
    00:00:56 Learning from Bill Gates
    00:04:05 Science vs Translation
    00:10:45 The Power of Physical Proximity in Science
    00:13:55 Building the Virtual Cell: From Data to Models
    00:15:51 Microscopes, Imaging, and Converting Atoms to Bits
    00:23:18 AI Meets Biology: The Frontier Lab Concept
    00:27:25 How Models Can Enable More Ambitious Research
    00:30:15 Precision Medicine and Clinical Impact
    00:45:17 The Virtual Immune System and Cellular Engineering
    00:48:27 Accelerating the Timeline: What It Takes to Cure All Disease
    00:28:45 Joining Forces with Evolutionary Scale