Avsnitt

  • Nick Larus-Stone is the Head of AI at Benchling, the R&D data platform that life science companies use to store and manage their experiments, samples, instruments, and analysis. Benchling has been around for since 2012. In October 2025, it launched Benchling AI, an intelligence layer with a chat interface, backed by an agent, that helps scientists find data, design experiments, and write reports. Nick came to Benchling through its acquisition of Sphinx Bio, the analysis startup he founded. In this conversation, Nick walks through what it takes to build agents for scientific work, and where the playbook from coding agents holds up and where it breaks down.

    We also discuss:

    Why Benchling invests so heavily in getting clean data upfrontHow they cross-check answers between models to get more out of each oneWhy and how Benchling leans on production tracesWhere AI actually helps science today, and where it still gets stuckWhy understanding LLMs is closer to biology than software engineering

    Timestamps:

    00:00 Intro

    01:22 What Benchling AI is, and the 14-year data platform underneath it

    04:36 Why a decade of structured data is a core advantage

    05:57 The architecture under the hood

    08:28 Similarities and differences compared to a coding harness

    11:14 Benchling’s multi-agent architectures

    14:36 Dealing with verifiable vs non-verifiable tasks

    16:19 Doing evals when clean benchmarks aren’t possible

    18:13 Context engineering: SQL vs. file-based harnesses

    22:11 Memory: agents that create and update their own skills

    25:30 What user education for scientists looks like

    30:33 Why understanding LLMs is closer to biology than software

    33:28 When will agents discover a novel cure for disease?

    44:58 The future of harnesses in science

    48:13 Why fine-tuning on biology hasn't beaten frontier models

    References:

    Agent Skills (Claude Docs)Benchling’s Deep Research AgentClaude (Anthropic)Design of experiments (DOE)FDA Investigational New Drug (IND) applicationGemini (Google)Google AI co-scientistLangSmithModel Context Protocol (MCP)The Ralph (Wiggum) Loop (Geoffrey Huntley)Sphinx Bio

    Where to find Nick:

    BenchlingLinkedInTwitter/X

    Where to find Harrison:

    LinkedInTwitter/X

    Where to find LangChain:

    WebsiteDocs

    Send feedback or questions to [email protected]

  • Geng Sng is co-founder and CTO of Cogent, which builds autonomous agents that remediate vulnerabilities for enterprise security teams. Today, Cogent's agents process billions of security events per day, maintaining a live context graph of every asset and vulnerability across customer environments. In this conversation, Geng walks through Cogent's hot vs cold context split, the sub-agents that handle side quests, and the two graphs they run in parallel.

    We also discuss:

    Why defensive security is harder for AI than offensiveUnder the hood of Cogent's three agentsInside Cogent's “read only” by-default sandboxesWhy graph databases don't scale for security dataCogent Research and the move into formal verificationWhy interactive agents need a deeper planning phase to one-shot

    Referenced:

    Abnormal AIAmazon S3AnthropicBashChatGPTClaude CodeClaude MythosCodeMenderCodexCogentCursorGoogle DeepMindGPT-5.5-CyberJupyterLettaMozillaOpenAIOpus 4.6Opus 4.7Vercel

    Where to find Geng:

    LinkedIn

    Where to find Harrison:

    LinkedInTwitter/X

    Where to find LangChain:

    WebsiteDocs

    Send feedback or questions to [email protected]

    Timestamps:

    00:00 Why mean time to exploit collapsed from years to minutes

    02:08 Inside Cogent's Agent Lake architecture

    05:11 Why Cogent rejected graph databases

    10:48 The trust ladder before agents touch production

    15:13 The three types of agents inside Cogent

    17:07 How Cogent sandboxes its agents

    19:16 Short-circuiting interactive agents with a deeper planning phase

    24:31 What to do when users believe agents too much

    31:21 Why sub-agents let agents go on side quests

    34:59 Two-tiered evals and the metric that catches bad prompts

    40:00 Cogent’s unique approach to context

    48:39 Cogent Research and the move into formal verification

    51:33 The single trait Cogent hires for

    54:00 Open-sourcing models within six months

    57:07 Why defensive security won’t be commoditized anytime soon

    1:00:51 The founding insight behind Cogent

  • Saknas det avsnitt?

    Klicka här för att uppdatera flödet manuellt.

  • Alexander Shevchenko is the head of applied research at Ramp, where he leads Ramp Labs – the team behind Ramp Sheets and a steady stream of public AI engineering experiments. Ramp Sheets started as an internal process mining tool that turned Loom videos of accountants into Markov diagrams, before evolving into the agentic spreadsheet editor that shipped in November. In this conversation, Alex walks through the architecture under the hood, why Ramp biases the agent toward Excel formulas over Python code gen, and two recent Labs experiments: Latent Briefing and a user-steerable revival of Golden Gate Claude.

    We also discuss:

    Under the hood of Ramp SheetsInspect, Ramp's internal coding agent, and the self-improving monitor loop it powersWhy finance professionals rejected code gen as too "black box"Why Anthropic models tend to excel at agentic spreadsheet manipulationThe case for putting the agent outside the sandbox, not inside itThe Loom-to-Markov-diagram process mining pipelineRLMs and how subagents can share memory in latent spaceLatent Briefing and KV-cache communication between subagentsReviving Golden Gate Claude with steering vectors on Gemma

    Referenced:

    Alex LevinsonAnthropicBen GeistClaudeEfficient Memory Sharing for Multi-Agent Systems via KV Cache Compaction (Ben Geist)GemmaGolden Gate ClaudeGraphvizInspectLatent BriefingLoomModalOpenAIOpusQwenRampRamp LabsRamp SheetsRecursive Language Models (Alex Zhang)RetoolSelf-maintaining Ramp SheetsSteer AI

    Where to find Alex:

    LinkedInTwitter/XWebsite

    Where to find Harrison:

    LinkedInTwitter/X

    Where to find LangChain:

    WebsiteDocs

    Send feedback or questions to [email protected]

    Timestamps:

    00:00 Introduction

    01:13 The origin of Ramp Sheets

    02:27 The Loom-to-Markov-diagram process mining pipeline

    04:28 Why code gen approaches felt too "black box" to finance

    06:13 Meeting finance where they already are: inside the spreadsheet

    09:08 How far process mining got them

    10:31 Text descriptions and Graphviz DAGs as output

    12:41 Under the hood of Ramp Sheets

    14:52 Why the agent uses Python only as an escape hatch

    15:47 Why Anthropic models excel at agentic spreadsheet manipulation

    17:12 Frankensteining the OpenAI Agents SDK

    17:43 The Ramp Sheets UX and fast vs. expert mode

    19:58 Agent in a sandbox vs. agent with a sandbox

    21:55 Vibe evals with expert humans

    23:40 Inspect, the internal coding agent

    24:13 The self-monitoring loop and auto-PRs

    28:01 Other wacky experiments on Sheets

    28:43 Memory experiments that didn't pan out

    31:16 Latent Briefing and KV-cache subagent communication

    35:13 Reviving Golden Gate Claude

    37:47 Contrastive pairs and steering vectors

    39:47 Picking the right layers in Gemma

    41:37 What Ramp Labs looks for when hiring

  • Florian Juengermann is the co-founder and CTO of Listen, an AI startup that turns qualitative research across hundreds of interviews, surveys, and focus groups into structured, traceable insights. Listen's agents analyze responses at scale, and Florian has rearchitected the system multiple times to get there. In this conversation, he walks through the virtual table architecture at the core of their Research Agent, how small models run map-reduce classification across thousands of open-ended responses, and the self-reviewing feedback subagent that catches errors during long async runs.

    We also discuss:

    The three agents inside Listen's platformHow Listen rearchitected from a simple RAG bot to a multi-agent system multiple timesWhy the PowerPoint subagent was completely rebuilt using Claude's code SDKContextual prompt engineering as an alternative to skillsHow Listen keeps report numbers live as new interview responses come inWhen to trigger the long-running agent vs. showing early resultsWhat Florian looks for when hiring agent engineers

    References:

    AnthropicChatGPTClaudeClaude Code SDKE2BEmotional IntelligenceGPT MiniHaikuListenOpenAIPandasPostgresPythonResearch AgentRenderZoom

    Where to find Florian:

    LinkedInTwitter/X

    Where to find Harrison:

    LinkedInTwitter/X

    Where to find LangChain:

    WebsiteDocs

    Send feedback or questions to [email protected]

    Timestamps

    00:00 Introduction

    01:25 The three agents inside Listen's platform

    03:15 Live chat vs. long async runs, and how Listen tunes for each

    05:33 Under the hood of the Research Agent

    06:37 Listen's virtual table architecture

    07:34 How small models classify thousands of open-ended responses

    10:05 Running code in a sandbox: how E2B fits in

    11:52 Why Listen rebuilt the PowerPoint subagent from scratch

    14:11 Contextual prompt engineering instead of skills

    16:32 The feedback subagent that reviews its own reports

    18:14 How Listen runs evals in production

    19:47 Unexpected ways users push the agent to its limits

    21:42 How many times Listen has rearchitected, and why

    24:59 Trace observability: depth over breadth

    26:10 Lessons from running Claude Code SDK inside E2B

    27:42 Memory: what's solved and what isn't

    29:10 The Composer agent UX: co-editing a document with AI

    35:50 How Listen keeps report numbers live as new responses come in

    43:47 What Listen looks for when hiring agent engineers

  • Izzy Miller is an AI engineer at Hex, an AI analytics platform that was one of the first companies to ship data agents to real paying users. Today, Hex runs a multi-agent system with nearly 100K tokens of tools, and Izzy is building a 90-day simulation to evaluate whether those agents actually get smarter over time. In this conversation, he walks through the harness decisions that shaped their architecture, the failure modes Hex is seeing at scale, and what it takes to build an eval that no current model can pass.

    We also discuss:

    Why data agents are harder to verify than coding agentsUnder the hood of Hex’s agentsHow Hex is unifying separate agentsWhy most eval sets are badThe 90-day simulation for long-horizon evalsHow Izzy went from marketing to AI engineer

    References:

    Andon LabsAnthropicBarry McCardelChatGPTClaude CodeClaude Sonnet 4.6DBTGPT-3.5 TurboGPT-5.3 Codex SparkGPT-5.4HexLangChainLangSmithLookerOpenAIOpus 4.6Satya NadellaSnowflakeVending Machine

    Where to find Izzy:

    LinkedInTwitter/X

    Where to find Harrison:

    LinkedInTwitter/X

    Where to find LangChain:

    WebsiteDocs

    Send feedback or questions to [email protected]

    Timestamps:

    01:35 Where Hex's notebook agent started

    03:46 The moment Hex knew it was time for agents

    07:36 Why data agents are harder to verify than coding agents

    09:30 How Hex is unifying separate agents

    13:28 Under the hood of the notebook agent

    15:41 The harness features that are now holding the agent back

    17:41 Why Hex built their own orchestrator

    18:59 Managing nearly 100K tokens of tools

    20:49 Ephemeral queries and agent behavior trade-offs

    24:46 The UX problem with showing agents' thinking

    27:28 Why verification is harder than transparency for data agents

    31:00 Memory, context conflicts, and collapse modes

    34:38 How Hex built their internal eval system

    39:29 Why most eval sets are bad

    44:30 The 900% quota eval that every model fails

    46:55 Model upgrades and the "in distribution" debate

    51:34 How Izzy went from marketer to AI engineer

    59:59 The 90-day simulation for long-horizon evals

  • Welcome to Max Agency, the podcast that goes deep into how the best agents are being built by builders like you. I'm Harrison Chase, CEO of LangChain, the agent engineering company, and I'll be your host.