Avsnitt
-
We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch.
—-
From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users.
We discuss:
The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent
The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in
The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves)
Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities
The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science
New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon
Chapters
00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey
00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup
00:02:47 The Decision to Start a Company: Scaling Beyond Academia
00:03:38 The $100M Raise: Use of Funds and Platform Economics
00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics
00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation
00:08:12 Educational Value and Learning from the Community
00:08:41 Technical Migration: From Gradio to React and Platform Evolution
00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity
00:12:29 Nano Banana Moment: How Preview Models Create Market Impact
00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value
00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity
00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals
00:19:10 API Strategy and Focus: Doing One Thing Well
00:19:51 Community Management and Retention: Sign-In, History, and Daily Value
00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses
00:21:49 Hiring and Building a High-Performance Team -
From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.
We discuss:
The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem
Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon
Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance
The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off
The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression
Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension
—
RL1000 Team (Princeton)
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1
Chapters
00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience
00:01:11 Team Introductions and Princeton Research Origins
00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow
00:04:35 Self-Supervised RL: A Different Approach to Scaling
00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth
00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients
00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives
00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning
00:09:44 From TD Errors to Classification: Why This Objective Scales
00:11:06 Architecture Details: Building on Braw and SymbaFowl
00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision
00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling
00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure
00:18:05 World Models and Next State Classification
00:22:37 Unlocking Batch Size Scaling Through Network Capacity
00:24:10 Compute Requirements: State-of-the-Art on a Single GPU
00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning
00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling -
Saknas det avsnitt?
-
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.
We discuss:
John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks
The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number")
SWE-bench Verified: the curated, high-quality split that became the standard for serious evals
SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution
The SWE-bench Pro controversy: independent authors used the "SWE-bench" name without John's blessing, but he's okay with it ("congrats to them, it's a great benchmark")
CodeClash: John's new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)
SWE-Efficiency (Jeffrey Maugh, John's high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)
AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)
The Tau-bench "impossible tasks" debate: some tasks are underspecified or impossible, but John thinks that's actually a feature (flags cheating if you score above 75%)
Cognition's research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)
The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve
—
John Yang
SWE-bench: https://www.swebench.com
X: https://x.com/jyangballin
Chapters
00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations
00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race
00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants
00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories
00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments
00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas
00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing
00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation
00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity
00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration
00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research -
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks.
We discuss:
Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model
Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?"
The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly
How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish
The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness)
Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes)
The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows
Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model "just a tool"
The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge
Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length
Why the education system isn't producing enough people who can do both distributed systems and ML research, and why that's the bottleneck for frontier labs
The 2026 vision: neither pre-training nor post-training is dead, we're in the fog of war, and the bottleneck will keep moving (so emotional stability helps)
—
Josh McGrath
OpenAI: https://openai.com
https://x.com/j_mcgraph
Chapters
00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI
00:04:37 The Shopping Model: Black Friday Launch and Interruptability
00:07:11 Model Personality and the Anton vs Clippy Divide
00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL
00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training
00:13:12 Token Efficiency: The 2D Plot That Matters Most
00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting
00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context
00:21:23 The ML-Systems Hybrid: What's Hard to Hire For
00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution -
From Berkeley robotics and OpenAI's 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI's reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn't change the world when o1 actually achieved it, how RL doesn't generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn't pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity.
We discuss:
Ashvin's path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months
Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman's take)
The IOI Gold paradox: "If you told me we'd achieve IOI Gold in 2022, I'd assume we could all go on vacation—AI solved, no point working anymore. But life is still the same."
The RL research era (2017–2022) and why most of it didn't pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize
Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing "surprisingly accurate reasoning traces" on math, and first-principles belief that scaled
The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up
Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product
Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily
The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room)
Why off-policy RL is unstable (Ashvin's favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews
The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice
—
Ashvin Nair
Cursor: https://cursor.com
X: https://x.com/ashvinnair_
Chapters
00:00:00 Introduction: From Robotics to Cursor via OpenAI
00:01:58 The Robotics to LLM Agent Transition: Why Code Won
00:09:11 RL Research Winter and Academic Overfitting
00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI
00:21:30 OpenAI's Reasoning Journey: From Codex to O1
00:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance
00:22:39 RL for Reasoning: The O-Series Conviction and Scaling
00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles
00:33:07 Why Cursor: Co-Designing Products and Models for Real Work
00:34:14 Composer and the Future: Online Learning Every Two Hours
00:35:15 Continual Learning: The Missing Paradigm Shift
00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable -
From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it's not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before.
We discuss:
The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories
How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads
Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran
The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal ("we're a unicorn") over partnership or dilution discipline
Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category
The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules)
Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable
Sarah's investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before
Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale
Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor
—
Sarah Catanzaro
X: https://x.com/sarahcat21
Amplify Partners: https://amplifypartners.com/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI
00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack
00:05:26 Data Catalogs and What Went Wrong
00:08:16 Data Infrastructure at AI Labs: Surprising Insights
00:10:13 The Crazy Funding Environment of 2024-2025
00:17:18 World Models: Hype, Confusion, and Market Potential
00:18:59 Memory Management and Continual Learning: The Next Frontier
00:23:27 Agent Environments: Just a Fad?
00:25:48 The Perfect AI Startup: Research Meets Application
00:28:02 Closing Thoughts and Where to Find Sarah -
One year ago, Anthropic launched the Model Context Protocol (MCP)—a simple, open standard to connect AI applications to the data and tools they need. Today, MCP has exploded from a local-only experiment into the de facto protocol for agentic systems, adopted by OpenAI, Microsoft, Google, Block, and hundreds of enterprises building internal agents at scale. And now, MCP is joining the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, alongside Block's Goose coding agent, with founding members spanning the biggest names in AI and cloud infrastructure.
We sat down with David Soria Parra (MCP lead, Anthropic), Nick Cooper (OpenAI), Brad Howes (Block / Goose), and Jim Zemlin (Linux Foundation CEO) to dig into the one-year journey of MCP—from Thanksgiving hacking sessions and the first remote authentication spec to long-running tasks, MCP Apps, and the rise of agent-to-agent communication—and the behind-the-scenes story of how three competitive AI labs came together to donate their protocols and agents to a neutral foundation, why enterprises are deploying MCP servers faster than anyone expected (most of it invisible, internal, and at massive scale), what it takes to design a protocol that works for both simple tool calls and complex multi-agent orchestration, how the foundation will balance taste-making (curating meaningful projects) with openness (avoiding vendor lock-in), and the 2025 vision: MCP as the communication layer for asynchronous, long-running agents that work while you sleep, discover and install their own tools, and unlock the next order of magnitude in AI productivity.
We discuss:
The one-year MCP journey: from local stdio servers to remote HTTP streaming, OAuth 2.1 authentication (and the enterprise lessons learned), long-running tasks, and MCP Apps (iframes for richer UI)
Why MCP adoption is exploding internally at enterprises: invisible, internal servers connecting agents to Slack, Linear, proprietary data, and compliance-heavy workflows (financial services, healthcare)
The authentication evolution: separating resource servers from identity providers, dynamic client registration, and why the March spec wasn't enterprise-ready (and how June fixed it)
How Anthropic dogfoods MCP: internal gateway, custom servers for Slack summaries and employee surveys, and why MCP was born from "how do I scale dev tooling faster than the company grows?"
Tasks: the new primitive for long-running, asynchronous agent operations—why tools aren't enough, how tasks enable deep research and agent-to-agent handoffs, and the design choice to make tasks a "container" (not just async tools)
MCP Apps: why iframes, how to handle styles and branding, seat selection and shopping UIs as the killer use case, and the collaboration with OpenAI to build a common standard
The registry problem: official registry vs. curated sub-registries (Smithery, GitHub), trust levels, model-driven discovery, and why MCP needs "npm for agents" (but with signatures and HIPAA/financial compliance)
The founding story of AAIF: how Anthropic, OpenAI, and Block came together (spoiler: they didn't know each other were talking to Linux Foundation), why neutrality matters, and how Jim Zemlin has never seen this much day-one inbound interest in 22 years
—
David Soria Parra (Anthropic / MCP)
MCP: https://modelcontextprotocol.io
https://uk.linkedin.com/in/david-soria-parra-4a78b3a
https://x.com/dsp_
Nick Cooper (OpenAI)
X: https://x.com/nicoaicopr
Brad Howes (Block / Goose)
Goose: https://github.com/block/goose
Jim Zemlin (Linux Foundation)
LinkedIn: https://www.linkedin.com/in/zemlin/
Agentic AI Foundation
https://agenticai.foundation
Chapters
00:00:00 Introduction: MCP's First Year and Foundation Launch
00:01:17 MCP's Journey: From Launch to Industry Standard
00:02:06 Protocol Evolution: Remote Servers and Authentication
00:08:52 Enterprise Authentication and Financial Services
00:11:42 Transport Layer Challenges: HTTP Streaming and Scalability
00:15:37 Standards Development: Collaboration with Tech Giants
00:34:27 Long-Running Tasks: The Future of Async Agents
00:30:41 Discovery and Registries: Building the MCP Ecosystem
00:30:54 MCP Apps and UI: Beyond Text Interfaces
00:26:55 Internal Adoption: How Anthropic Uses MCP
00:23:15 Skills vs MCP: Complementary Not Competing
00:36:16 Community Events and Enterprise Learnings
01:03:31 Foundation Formation: Why Now and Why Together
01:07:38 Linux Foundation Partnership: Structure and Governance
01:11:13 Goose as Reference Implementation
01:17:28 Principles Over Roadmaps: Composability and Quality
01:21:02 Foundation Value Proposition: Why Contribute
01:27:49 Practical Investments: Events, Tools, and Community
01:34:58 Looking Ahead: Async Agents and Real Impact -
Note: Steve and Gene’s talk on Vibe Coding and the post IDE world was one of the top talks of AIE CODE: https://www.youtube.com/watch?v=7Dtu2bilcFs&t=1019s&pp=0gcJCU0KAYcqIYzv
From building legendary platforms at Google and Amazon to authoring one of the most influential essays on AI-powered development (Revenge of the Junior Developer, quoted by Dario Amodei himself), Steve Yegge has spent decades at the frontier of software engineering—and now he's leading the charge into what he calls the "factory farming" era of code. After stints at SourceGraph and building Beads (a purely vibe-coded issue tracker with tens of thousands of users), Steve co-authored The Vibe Coding Book and is now building VC (VibeCoder), an agent orchestration dashboard designed to move developers from writing code to managing fleets of AI agents that coordinate, parallelize, and ship features while you sleep.
We sat down with Steve at AI Engineer Summit to dig into why Claude Code, Cursor, and the entire 2024 stack are already obsolete, what it actually takes to trust an agent after 2,000 hours of practice (hint: they will delete your production database if you anthropomorphize them), why the real skill is no longer writing code but orchestrating agents like a NASCAR pit crew, how merging has become the new wall that every 10x-productive team is hitting (and why one company's solution is literally "one engineer per repo"), the rise of multi-agent workflows where agents reserve files, message each other via MCP, and coordinate like a little village, why Steve believes if you're still using an IDE to write code by January 1st, you're a bad engineer, how the 12–15 year experience bracket is the most resistant demographic (and why their identity is tied to obsolete workflows), the hidden chaos inside OpenAI, Anthropic, and Google as they scale at breakneck speed, why rewriting from scratch is now faster than refactoring for a growing class of codebases, and his 2025 prediction: we're moving from subsistence agriculture to John Deere-scale factory farming of code, and the Luddite backlash is only just beginning.
We discuss:
Why Claude Code, Cursor, and agentic coding tools are already last year's tech—and what comes next: agent orchestration dashboards where you manage fleets, not write lines
The 2,000-hour rule: why it takes a full year of daily use before you can predict what an LLM will do, and why trust = predictability, not capability
Steve's hot take: if you're still using an IDE to develop code by January 1st, 2025, you're a bad engineer—because the abstraction layer has moved from models to full-stack agents
The demographic most resistant to vibe coding: 12–15 years of experience, senior engineers whose identity is tied to the way they work today, and why they're about to become the interns
Why anthropomorphizing LLMs is the biggest mistake: the "hot hand" fallacy, agent amnesia, and how Steve's agent once locked him out of prod by changing his password to "fix" a problem
Should kids learn to code? Steve's take: learn to vibe code—understand functions, classes, architecture, and capabilities in a language-neutral way, but skip the syntax
The 2025 vision: "factory farming of code" where orchestrators run Cloud Code, scrub output, plan-implement-review-test in loops, and unlock programming for non-programmers at scale
—
Steve Yegge
X: https://x.com/steve_yegge
Substack (Stevie's Tech Talks): https://steve-yegge.medium.com/
GitHub (VC / VibeCoder): https://github.com/yegge-labs
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction: Steve Yegge on Vibe Coding and AI Engineering
00:00:59 The Backlash: Who Resists Vibe Coding and Why
00:04:26 The 2000 Hour Rule: Building Trust with AI Coding Tools
00:03:31 The January 1st Deadline: IDEs Are Becoming Obsolete
00:02:55 10X Productivity at OpenAI: The Performance Review Problem
00:07:49 The Hot Hand Fallacy: When AI Agents Betray Your Trust
00:11:12 Claude Code Isn't It: The Need for Agent Orchestration
00:15:20 The Orchestrator Revolution: From Cloud Code to Agent Villages
00:18:46 The Merge Wall: The Biggest Unsolved Problem in AI Coding
00:26:33 Never Rewrite Your Code - Until Now: Joel Spolsky Was Wrong
00:22:43 Factory Farming Code: The John Deere Era of Software
00:29:27 Google's Gemini Turnaround and the AI Lab Chaos
00:33:20 Should Your Kids Learn to Code? The New Answer
00:34:59 Code MCP and the Gossip Rate: Latest Vibe Coding Discoveries -
From the frontlines of OpenAI's Codex and GPT-5 training teams, Bryan and Bill are building the future of AI-powered coding—where agents don't just autocomplete, they architect, refactor, and ship entire features while you sleep. We caught up with them at AI Engineer Conference right after the launch of Codex Max, OpenAI's newest long-running coding agent designed to work for 24+ hours straight, manage its own context, and spawn sub-agents to parallelize work across your entire codebase.
We sat down with Bryan and Bill to dig into what it actually takes to train a model that developers trust—why personality, communication, and planning matter as much as raw capability, how Codex is trained with strong opinions about tools (it loves rg over grep, seriously), why the abstraction layer is moving from models to full-stack agents you can plug into VS Code or Zed, how OpenAI partners co-develop tool integrations and discover unexpected model habits (like renaming tools to match Codex's internal training), the rise of applied evals that measure real-world impact instead of academic benchmarks, why multi-turn evals are the next frontier (and Bryan's "job interview eval" idea), how coding agents are breaking out of code into personal automation, terminal workflows, and computer use, and their 2026 vision: coding agents trusted enough to handle the hardest refactors at any company, not just top-tier firms, and general enough to build integrations, organize your desktop, and unlock capabilities you'd never get access to otherwise.
We discuss:
What Codex Max is: a long-running coding agent that can work 24+ hours, manage its own context window, and spawn sub-agents for parallel work
Why the name "Max": maximalist, maximization, speed and endurance—it's simply better and faster for the same problems
Training for personality: communication, planning, context gathering, and checking your work as behavioral characteristics, not just capabilities
How Codex develops habits like preferring rg over grep, and why renaming tools to match its training (e.g., terminal-style naming) dramatically improves tool-call performance
The split between Codex (opinionated, agent-focused, optimized for the Codex harness) and GPT-5 (general, more durable across different tools and modalities)
Why the abstraction layer is moving up: from prompting models to plugging in full agents (Codex, GitHub Copilot, Zed) that package the entire stack
The rise of sub-agents and agents-using-agents: Codex Max spawning its own instances, handing off context, and parallelizing work across a codebase
How OpenAI works with coding partners on the bleeding edge to co-develop tool integrations and discover what the model is actually good at
The shift to applied evals: capturing real-world use cases instead of academic benchmarks, and why ~50% of OpenAI employees now use Codex daily
Why multi-turn evals are the next frontier: LM-as-a-judge for entire trajectories, Bryan's "job interview eval" concept, and the need for a batch multi-turn eval API
How coding agents are breaking out of code: personal automation, organizing desktops, terminal workflows, and "Devin for non-coding" use cases
Why Slack is the ultimate UI for work, and how coding agents can become your personal automation layer for email, files, and everything in between
The 2026 vision: more computer use, more trust, and coding agents capable enough that any company can access top-tier developer capabilities, not just elite firms
—
Bryan & Bill (OpenAI Codex Team)
http://x.com/bfioca
https://x.com/realchillben
OpenAI Codex: https://openai.com/index/openai-codex/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction: Latent Space Listeners at AI Engineer Code
00:01:27 Codex Max Launch: Training for Long-Running Coding Agents
00:03:01 Model Personality and Trust: Communication, Planning, and Self-Checking
00:05:20 Codex vs GPT-5: Opinionated Agents vs General Models
00:07:47 Tool Use and Model Habits: The Ripgrep Discovery
00:09:16 Personality Design: Verbosity vs Efficiency in Coding Agents
00:11:56 The Agent Abstraction Layer: Building on Top of Codex
00:14:08 Sub-Agents and Multi-Agent Patterns: The Future of Composition
00:16:11 Trust and Adoption: OpenAI Developers Using Codex Daily
00:17:21 Applied Evals: Real-World Testing vs Academic Benchmarks
00:19:15 Multi-Turn Evals and the Job Interview Pattern
00:21:35 Feature Request: Batch Multi-Turn Eval API
00:22:28 Beyond Code: Personal Automation and Computer Use
00:24:51 Vision-Native Agents and the UI Integration Challenge
00:25:02 2026 Predictions: Trust, Computer Use, and Democratized Excellence -
as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)
From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980784425931067?s=46), SAM can now even segment audio output!
We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.
We discuss:
What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can"
How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly
Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking
The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity
The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2
Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale
Architecture innovations: presence token to separate recognition ("is it in the image?") from localization ("where is it?"), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking
Building on Meta's ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2's memory-based tracking backbone
SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like "find the bigger character" or "what distinguishes male from female in this image"
Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples
Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more
—
MSL FAIR team
Nikhila: https://www.linkedin.com/in/nikhilaravi/
Pengchuan: https://pzzhang.github.io/pzzhang/
Joseph Nelson
X: https://x.com/josephofiowa
LinkedIn: https://www.linkedin.com/in/josephofiowa/
[FLIGHTCAST_CHATPERS] -
Note: this is Pliny and John’s first major podcast. Voices have been changed for opsec.
From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-source AI security, Pliny the Liberator and John V are redefining what AI red-teaming looks like when you refuse to lobotomize models in the name of "safety."
Pliny built his reputation crafting universal jailbreaks—skeleton keys that obliterate guardrails across modalities—and open-sourcing prompt templates like Libertas, predictive reasoning cascades, and the infamous "Pliny divider" that's now embedded so deep in model weights it shows up unbidden in WhatsApp messages. John V, coming from prompt engineering and computer vision, co-founded the Bossy Discord (40,000 members strong) and helps steer BT6's ethos: if you can't open-source the data, we're not interested. Together they've turned down enterprise gigs, pushed back on Anthropic's closed bounties, and insisted that real AI security happens at the system layer—not by bubble-wrapping latent space.
We sat down with Pliny and John to dig into the mechanics of hard vs. soft jailbreaks, why multi-turn crescendo attacks were obvious to hackers years before academia "discovered" them, how segmented sub-agents let one jailbroken orchestrator weaponize Claude for real-world attacks (exactly as Pliny predicted 11 months before Anthropic's recent disclosure), why guardrails are security theater that punishes capability while doing nothing for real safety, the role of intuition and "bonding" with models to navigate latent space, how BT6 vets operators on skill and integrity, why they believe Mech Interp and open-source data are the path forward (not RLHF lobotomization), and their vision for a future where spatial intelligence, swarm robotics, and AGI alignment research happen in the open—bootstrapped, grassroots, and uncompromising.
We discuss:
What universal jailbreaks are: skeleton-key prompts that obliterate guardrails across models and modalities, and why they're central to Pliny's mission of "liberation"
Hard vs. soft jailbreaks: single-input templates vs. multi-turn crescendo attacks, and why the latter were obvious to hackers long before academic papers
The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing "steered chaos" pulls models out-of-distribution
Why jailbreaking is 99% intuition and bonding with the model: probing token layers, syntax hacks, multilingual pivots, and forming a relationship to navigate latent space
The Anthropic Constitutional AI challenge drama: UI bugs, judge failures, goalpost moving, the demand for open-source data, and why Pliny sat out the $30k bounty
Why guardrails ≠ safety: security theater, the futility of locking down latent space when open-source is right behind, and why real safety work happens in meatspace (not RLHF)
The weaponization of Claude: how segmented sub-agents let one jailbroken orchestrator execute malicious tasks (pyramid-builder analogy), and why Pliny predicted this exact TTP 11 months before Anthropic's disclosure
BT6 hacker collective: 28 operators across two cohorts, vetted on skill and integrity, radical transparency, radical open-source, and the magic of moving the needle on AI security, swarm intelligence, blockchain, and robotics
—
Pliny the Liberator
X: https://x.com/elder_plinius
GitHub (Libertas): https://github.com/elder-plinius/L1B3RT45
John V
X: https://x.com/JohnVersus
BT6 & Bossy
BT6: https://bt6.gg
Bossy Discord: Search "Bossy Discord" or ask Pliny/John V on X
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction: Meet Pliny the Liberator and John V
00:01:50 The Philosophy of AI Liberation and Jailbreaking
00:03:08 Universal Jailbreaks: Skeleton Keys to AI Models
00:04:24 The Cat-and-Mouse Game: Attackers vs Defenders
00:05:42 Security Theater vs Real Safety: The Fundamental Disconnect
00:08:51 Inside the Libertas Repo: Prompt Engineering as Art
00:16:22 The Anthropic Challenge Drama: UI Bugs and Open Source Data
00:23:30 From Jailbreaks to Weaponization: AI-Orchestrated Attacks
00:26:55 The BT6 Hacker Collective and BASI Community
00:34:46 AI Red Teaming: Full Stack Security Beyond the Model
00:38:06 Safety vs Security: Meat Space Solutions and Final Thoughts -
Glean started as a Kleiner Perkins incubation and is now a $7B, $200m ARR Enterprise AI leader. Now KP has tapped its own podcaster to lead it’s next big swing.
From building go-to-market the hard way in startups (and scaling Palo Alto Networks’ public cloud business) to joining Kleiner Perkins to help technical founders turn product edge into repeatable revenue, Joubin Mirzadegan has spent the last decade obsessing over one thing: distribution and how ideas actually spread, sell, and compound. That obsession took him from launching the CRO-only podcast Grit (https://www.youtube.com/playlist?list=PLRiWZFltuYPF8A6UGm74K2q29UwU-Kk9k) as a hiring wedge, to working alongside breakout companies like Glean and Windsurf, to now incubating Roadrunner which is an AI-native rethink of CPQ and quoting workflows as pricing models collapse from “seats” into consumption, bundles, renewals, and SKU sprawl.
We sat down with Joubin to dig into the real mechanics of making conversations feel human (rolling early, never sending questions, temperature + lighting hacks), what Windsurf got right about “Google-class product and Salesforce-class distribution,” how to hire early sales leaders without getting fooled by shiny logos, why CPQ is quietly breaking the back of modern revenue teams, and his thesis for his new company and KP incubation Roadrunner (https://www.roadrunner.ai/): rebuild the data model from the ground up, co-develop with the hairiest design partners, and eventually use LLMs to recommend deal structures the way the best reps do without the Slack-channel chaos of deal desk.
We discuss:
How to make guests instantly comfortable: rolling early, no “are you ready?”, temperature, lighting, and room dynamics
Why Joubin refuses to send questions in advance (and when you might have to anyway)
The origin of the CRO-only podcast: using media as a hiring wedge and relationship engine
The “commit to 100 episodes” mindset: why most shows die before they find their voice
Founder vs exec interviews: why CEOs can speak more freely (and what it unlocks in conversation)
What Glean taught him about enterprise AI: permissions, trust, and overcoming “category is dead” skepticism
Design partners as the real unlock: why early believers matter and how co-development actually works
Windsurf’s breakout: what it means to be serious about “Google-class product + Salesforce-class distribution”
Why technical founders struggle with GTM and how KP built a team around sales, customer access, and demand gen
Hiring early sales leaders: anti-patterns (logos), what to screen for (motivation), and why stage-fit is everything
The CPQ problem & Roadrunner’s thesis: rebuilding CPQ/quoting from the data model up for modern complexity
How “rules + SKUs + approvals” create a brittle graph and what it takes to model it without tipping over
The two-year window: incumbents rebuilding slowly vs startups out-sprinting with AI-native architecture
Where AI actually helps: quote generation, policy enforcement, approval routing, and deal recommendation loops
—
Joubin
X: https://x.com/Joubinmir
LinkedIn: https://www.linkedin.com/in/joubin-mirzadegan-66186854/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction and the Zuck Interview Experience
00:03:26 The Genesis of the Grit Podcast: Hiring CROs Through Content
00:13:20 Podcast Philosophy: Creating Authentic Conversations
00:15:44 Working with Arvind at Glean: The Enterprise Search Breakthrough
00:26:20 Windsurf's Sales Machine: Google-Class Product Meets Salesforce-Class Distribution
00:30:28 Hiring Sales Leaders: Anti-Patterns and First Principles
00:39:02 The CPQ Problem: Why Salesforce and Legacy Tools Are Breaking
00:43:40 Introducing Roadrunner: Solving Enterprise Pricing with AI
00:49:19 Building Roadrunner: Team, Design Partners, and Data Model Challenges
00:59:35 High Performance Philosophy: Working Out Every Day and Reducing Friction
01:06:28 Defining Grit: Passion Plus Perseverance -
From applied cryptography and offensive security in France’s defense industry to optimizing nuclear submarine workflows, then selling his e-signature startup to Docusign (https://www.docusign.com/company/news-center/opentrust-joins-docusign-global-trust-network and now running AI as CTO of Superhuman Mail (Superhuman, recently acquired by Grammarly https://techcrunch.com/2025/07/01/grammarly-acquires-ai-email-client-superhuman/), Loïc Houssier has lived the full arc from deep infra and compliance hell to obsessing over 100ms product experiences and AI-native email. We sat down with Loïc to dig into how you actually put AI into an inbox without adding latency, why Superhuman leans so hard into agentic search and “Ask AI” over your entire email history, how they design tools vs. agents and fight agent laziness, what box-priced inference and local-first caching mean for cost and reliability, and his bet that your inbox will power your future AI EA while AI massively widens the gap between engineers with real fundamentals and those faking it.
We discuss:
Loïc’s path from applied cryptography and offensive security in France’s defense industry to submarines, e-signatures, Docusign, and now Superhuman Mail
What 3,000+ engineers actually do at a “simple” product like Docusign: regional compliance, on-prem appliances, and why global scale explodes complexity
How Superhuman thinks about AI in email: auto-labels, smart summaries, follow-up nudges, “Ask AI” search, and the rule that AI must never add latency or friction
Superhuman’s agentic framework: tools vs. agents, fighting “agent laziness,” deep semantic search over huge inboxes, and pagination strategies to find the real needle in the haystack
How they evaluate OpenAI, Anthropic, Gemini, and open models: canonical queries, end-to-end evals, date reasoning, and Rahul’s infamous “what wood was my table?” test
Infra and cost philosophy: local-first caching, vector search backends, Baseten “box” pricing vs. per-token pricing, and thinking in price-per-trillion-tokens instead of price-per-million
The vision of Superhuman as your AI EA: auto-drafting replies in your voice, scheduling on your behalf, and using your inbox as the ultimate private data source
How the Grammarly + Coda + Superhuman stack could power truly context-aware assistance across email, docs, calendars, contracts, and more
Inside Superhuman’s AI-dev culture: free-for-all tool adoption, tracking AI usage on PRs, and going from ~4 to ~6 PRs per engineer per week
Why Loïc believes everyone should still learn to code, and how AI will amplify great engineers with strong fundamentals while exposing shallow ones even faster
—
Loïc Houssier
LinkedIn: https://www.linkedin.com/in/houssier/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction and Loïc's Journey from Nuclear Submarines to Superhuman
00:06:40 Docusign Acquisition and the Enterprise Email Stack
00:10:26 Superhuman's AI Vision: Your Inbox as the Real AI Agent
00:13:20 Ask AI: Agentic Search and the Quality Problem
00:18:20 Infrastructure Choices: Model Selection, Base10, and Cost Management
00:27:30 Local-First Architecture and the Database Stack
00:30:50 Evals, Quality, and the Rahul Wood Table Test
00:42:30 The Future EA: Auto-Drafting and Proactive Assistance
00:46:40 Grammarly Acquisition and the Contextual Advantage
00:38:40 Voice, Video, and the End of Writing
00:51:40 Knowledge Graphs: The Hard Problem Nobody Has Solved
00:56:40 Competing with OpenAI and the Browser Question
01:02:30 AI Coding Tools: From 4 to 6 PRs Per Week
01:08:00 Engineering Culture, Hiring, and the Future of Software Development -
From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs.
We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world.
We discuss:
How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models
Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans
Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe
Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation
Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players
Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab
How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent
GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API
Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events
The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world
—
Pim
X: https://x.com/PimDeWitte
LinkedIn: https://www.linkedin.com/in/pimdw/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction and Medal's Gaming Data Advantage
00:02:08 Exclusive Demo: Vision-Based Gaming Agents
00:06:17 Action Prediction and Real-World Video Transfer
00:08:41 World Models: Interactive Video Generation
00:13:42 From Runescape to AI: Pim's Founder Journey
00:16:45 The Research Foundations: Diamond, Genie, and SEMA
00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI
00:35:04 Data Moats and Why GI Stayed Independent
00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course
00:40:28 Defining World Models vs Video Generation
00:41:52 Why Simulation Complexity Favors World Models
00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race
00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships
00:58:57 From Imitation Learning to RL: Making Clips Playable
01:00:15 Open Research, Academic Partnerships, and Hiring
01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions -
Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D.
We discuss:
The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone.
What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets.
Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in.
Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning.
The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem.
Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters.
Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots.
How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world.
Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making.
—
Fei-Fei Li
X: https://x.com/drfeifei
LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247
Justin Johnson
X: https://x.com/jcjohnss
LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership
00:02:00 From ImageNet to World Models: The Evolution of Computer Vision
00:12:42 Dense Captioning and Early Vision-Language Work
00:19:57 Spatial Intelligence: Beyond Language Models
00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model
00:33:21 Gaussian Splats and the Technical Architecture of Marble
00:22:10 Physics, Dynamics, and the Future of World Models
00:41:09 Multimodality and the Interplay of Language and Space
00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI
00:56:58 Hiring, Research Directions, and the Future of World Labs -
Alex Lieberman and Arman Hezarkani, co-founders of Tenex, reveal how they're revolutionizing software consulting by compensating AI engineers for output rather than hours—enabling some engineers to earn over $1 million annually while delivering 10x productivity gains. Their company represents a fundamental rethinking of knowledge work compensation in the age of AI agents, where traditional hourly billing models perversely incentivize slower work even as AI tools enable unprecedented speed.
The Genesis: From 90% Downsizing to 10x Output The story behind 10X begins with Arman's previous company, Parthian, where he was forced to downsize his engineering team by 90%. Rather than collapse, Arman re-architected the entire product and engineering process to be AI-first—and discovered that production-ready software output increased 10x despite the massive headcount reduction. This counterintuitive result exposed a fundamental misalignment: engineers compensated by the hour are disincentivized from leveraging AI to work faster, even when the technology enables dramatic productivity gains. Alex, who had invested in Parthian, initially didn't believe the numbers until Arman walked him through why LLMs have made such a profound impact specifically on engineering as knowledge work.
The Economic Model: Story Points Over Hours 10X's core innovation is compensating engineers based on story points—units of completed, quality output—rather than hours worked. This creates direct economic incentives for engineers to adopt every new AI tool, optimize their workflows, and maximize throughput. The company expects multiple engineers to earn over $1 million in cash compensation next year purely from story point earnings. To prevent gaming the system, they hire for two profiles: engineers who are "long-term selfish" (understanding that inflating story points will destroy client relationships) and those who genuinely love writing code and working with smart people. They also employ technical strategists incentivized on client retention (NRR) who serve as the final quality gate before any engineering plan reaches a client.
Impressive Builds: From Retail AI to App Store Hits The results speak for themselves. In one project, 10X built a computer vision system for retail cameras that provides heat maps, queue detection, shelf stocking analysis, and theft detection—creating early prototypes in just two weeks for work that previously took quarters. They built Snapback Sports' mobile trivia app in one month, which hit 20th globally on the App Store. In a sales context, an engineer spent four hours building a working prototype of a fitness influencer's AI health coach app after the prospect initially said no—immediately moving 10X to the top of their vendor list. These examples demonstrate how AI-enabled speed fundamentally changes sales motions and product development timelines.
The Interview Process: Unreasonably Difficult Take-Homes Despite concerns that AI would make take-home assessments obsolete, 10X still uses them—but makes them "unreasonably difficult." About 50% of candidates don't even respond, but those who complete the challenge demonstrate the caliber needed. The interview process is remarkably short: two calls before the take-home, review, then one or two final meetings—completable in as little as a week. A signature question: "If you had infinite resources to build an AI that could replace either of us on this call, what would be the first major bottleneck?" The sophisticated answer isn't just "model intelligence" or "context length"—it's controlling entropy, the accumulating error rate that derails autonomous agents over time.
The Limiting Factor: Human Capital, Not Technology Despite being an AI-first company, 10X's primary constraint is human capital—finding and hiring enough exceptional engineers fast enough, then matching them with the right processes to maintain delivery quality as they scale. The company has ambitions beyond consulting to build their own technology, but for the foreseeable future, recruiting remains the bottleneck. This reveals an important insight about the AI era: even as technology enables unprecedented leverage, the constraint shifts to finding people who can harness that leverage effectively.
Chapters
00:00:00 Introduction and Meeting the 10X Co-founders
00:01:29 The 10X Moment: From Hourly Billing to Output-Based Compensation
00:04:44 The Economic Model Behind 10X
00:05:42 Story Points and Measuring Engineering Output
00:08:41 Impressive Client Projects and Rapid Prototyping
00:12:22 The 10X Tech Stack: TypeScript and High Structure
00:13:21 AI Coding Tools: The Daily Evolution
00:15:05 Human Capital as the Limiting Factor
00:16:02 The Unreasonably Difficult Interview Process
00:17:14 Entropy and Context Engineering: The Future of AI Agents
00:23:28 The MCP Debate and AI Industry Sociology
00:26:01 Consulting, Digital Transformation, and Conference Insights -
Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering.
We cover Glean’s rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools.
Chapters
00:00:00 Introduction and Deedy's Return to Latent Space
00:01:20 Glean's Journey: From Boring Enterprise Search to $7B Valuation
00:15:37 Anthropic's Meteoric Rise and Market Share Dynamics
00:17:50 Claude Artifacts and Product Innovation
00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem
00:48:01 Goodfire and Mechanistic Interpretability
00:51:25 Prime Intellect and Distributed AI Training
00:53:40 OpenRouter: Building the AI Model Gateway
01:13:36 The Stargate Project and Infrastructure Arms Race
01:18:14 The Future of Software Engineering and AI Coding -
Jared Palmer, SVP at GitHub and VP of CoreAI at Microsoft, joins Latent Space for an in-depth look at the evolution of coding agents and modern developer tools. Recently joining after leading AI initiatives at Vercel, Palmer shares firsthand insights from behind the scenes at GitHub Universe, including the launch of Agent HQ which is a new collaboration hub for coding agents and developers.
This episode traces Palmer’s journey from building Copilot inspired tools to pioneering the focused Next.js coding agent, v0, and explores how platform constraints fostered rapid experimentation and a breakout success in AI-powered frontend development. Palmer explains the unique advantages of GitHub’s massive developer network, the challenges of scaling agent-based workflows, and why integrating seamless AI into developer experiences is now a top priority for both Microsoft and GitHub. -
Jed Borovik, Product Lead at Google Labs, joins Latent Space to unpack how Google is building the future of AI-powered software development with Jules. From his journey discovering GenAI through Stable Diffusion to leading one of the most ambitious coding agent projects in tech, Borovik shares behind-the-scenes insights into how Google Labs operates at the intersection of DeepMind's model development and product innovation.
We explore Jules' approach to autonomous coding agents and why they run on their own infrastructure, how Google simplified their agent scaffolding as models improved, and why embeddings-based RAG is giving way to attention-based search. Borovik reveals how developers are using Jules for hours or even days at a time, the challenges of managing context windows that push 2 million tokens, and why coding agents represent both the most important AI application and the clearest path to AGI.
This conversation reveals Google's positioning in the coding agent race, the evolution from internal tools to public products, and what founders, developers, and AI engineers should understand about building for a future where AI becomes the new brush for software engineering.
Chapters
00:00:00 Introduction and GitHub Universe Recap
00:00:57 New York Tech Scene and East Coast Hackathons
00:02:19 From Google Search to AI Coding: Jed's Journey
00:04:19 Google Labs Mission and DeepMind Collaboration
00:06:41 Jules: Autonomous Coding Agents Explained
00:09:39 The Evolution of Agent Scaffolding and Model Quality
00:11:30 RAG vs Attention: The Shift in Code Understanding
00:13:49 Jules' Journey from Preview to Production
00:15:05 AI Engineer Summit: Community Building and Networking
00:25:06 Context Management in Long-Running Agents
00:29:02 The Future of Software Engineering with AI
00:36:26 Beyond Vibe Coding: Spec Development and Verification
00:40:20 Multimodal Input and Computer Use for Coding Agents -
Today’s guests are Priscilla Chan and Mark Zuckerberg, co-founders of Biohub (fka Chan Zuckerberg Initiative). They are one of the leading institutes for AI x Bio and open science research with projects like CELLxGENE, rbio1, VariantFormer, and many more. We talked about the evolution from a broad philanthropic institute to specializing in frontier AI + bio, why they are building 12ft tall microscopes to gather better data, and how building a virtual cell model + virtual immune system could potentially help us cure all diseases.
Chapters
00:00:00 Introduction and CZI's 10-Year Anniversary
00:00:56 Learning from Bill Gates
00:04:05 Science vs Translation
00:10:45 The Power of Physical Proximity in Science
00:13:55 Building the Virtual Cell: From Data to Models
00:15:51 Microscopes, Imaging, and Converting Atoms to Bits
00:23:18 AI Meets Biology: The Frontier Lab Concept
00:27:25 How Models Can Enable More Ambitious Research
00:30:15 Precision Medicine and Clinical Impact
00:45:17 The Virtual Immune System and Cellular Engineering
00:48:27 Accelerating the Timeline: What It Takes to Cure All Disease
00:28:45 Joining Forces with Evolutionary Scale - Visa fler