ThursdAI - The top AI news from the past week – Lyssna här

Avsnitt

📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news
29 maj· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back to another absolutely wild week in AI! I'm coming to you live from the Fontainebleau Hotel in Vegas at the Imagine AI conference, and wow, what a perfect setting to discuss how AI is literally reimagining our world. After last week's absolute explosion of releases (Claude Opus 4, Google I/O madness, OpenAI Codex and Jony colab), this week gave us a chance to breathe... sort of. Because even in a "quiet" week, we still got a new DeepSeek model that's pushing boundaries, and the entire internet discovered that we might all just be prompts. Yeah, it's been that kind of week!
Before we dive in, quick shoutout to everyone who joined us live - we had some technical hiccups with the Twitter Spaces audio (sorry about that!), but the YouTube stream was fire. And speaking of fire, we had two incredible guests join us: Charlie Holtz from Chorus (the multi-model chat app that's changing how we interact with AI) and Linus Eckenstam, who's been traveling the AI conference circuit and bringing us insights from the frontlines of the generative AI revolution.
Open Source AI & LLMs: DeepSeek Whales & Mind-Bending Papers
DeepSeek dropped R1-0528 out of nowhere, an update to their reasoning beast with some serious jumps in performance. We’re talking AIME at 91 (beating previous scores by a mile), LiveCodeBench at 73, and SWE verified at 57.6. It’s edging closer to heavyweights like o3, and folks on X are already calling it “clearer thinking.” There was hype it might’ve been R2, but the impact didn’t quite crash the stock exchange like past releases. Still, it’s likely among the best open-weight models out there.
So what's new? Early reports and some of my own poking around suggest this model "thinks clearer now." Nisten mentioned that while previous DeepSeek models sometimes liked to "vibe around" and explore the latent space before settling on an answer, this one feels a bit more direct.
And here’s the kicker—they also released an 8B distilled version based on Qwen3, runnable on your laptop. Yam called it potentially the best 8B model to date, and you can try it on Ollama right now. No need for a monster rig!
The Mind-Bending "Learning to Reason Without External Rewards" Paper
Okay, this paper result broke my brain, and apparently everyone else's too. This paper shows that models can improve through reinforcement learning with its own intuition of whether or not it's correct. 😮
It's like the placebo effect for AI! The researchers trained models without telling them what was good or bad, but rather, utilized a new framework called Intuitor, where the reward was based on how the "self certainty".
The thing that took my whole timeline by storm is, it works! GRPO (Group Policy Optimization) - the framework that DeepSeek gave to the world with R1 is based on external rewards (human optimize) and Intuitor seems to be mathcing or even exceeding some of GRPO results when Qwen2.5 3B was used to finetune. Incredible incredible stuff
Big Companies LLMs & APIs
Claude Opus 4: A Week Later – The Dev Darling?
Claude Opus 4, whose launch we celebrated live on the show, has had a week to make its mark. Charlie Holtz, who's building Chorus (more on that amazing app in a bit!), shared that while it's sometimes "astrology" to judge the vibes of a new model, Opus 4 feels like a step change, especially in coding. He mentioned that Claude Code, powered by Opus 4 (and Sonnet 4 for implementation), is now tackling GitHub issues that were too complex just weeks ago. He even had a coworker who "vibe coded three websites in a weekend" with it – that's a tangible productivity boost!
Linus Eckenstam highlighted how Lovable.dev saw their syntax error rates plummet by nearly 50% after integrating Claude 4. That’s quantifiable proof of improvement! It's clear Anthropic is leaning heavily into the developer/coding space. Claude Opus is now #1 on the LMArena WebDev arena, further cementing its reputation.
I had my own magical moment with Opus 4 this week. I was working on an MCP observability talk for the AI Engineer conference and trying to integrate Weave (our observability and evals framework at Weights & Biases) into a project. Using Windsurf's Cascade agent (which now lets you bring your own Opus 4 key, by the way – good move, Windsurf!), Opus 4 not only tried to implement Weave into my agent but, when it got stuck, it figured out it had access to the Weights & Biases support bot via our MCP tool. It then formulated a question to the support bot (which is also AI-powered!), got an answer, and used that to fix the implementation. It then went back and checked if the Weave trace appeared in the dashboard! Agents talking to agents to solve a problem, all while I just watched – my jaw was on the floor. Absolutely mind-blowing.
Quick Hits: Voice Updates from OpenAI & Anthropic
OpenAI’s Advanced Voice Mode finally sings—yes, I’ve been waiting for this! It can belt out tunes like Mariah Carey, which is just fun. Anthropic also rolled out voice mode on mobile, keeping up in the conversational race. Both are cool steps, but I’m more hyped for what’s next in voice AI—stay tuned below (OpenAI X, Anthropic X).
🐝 This Week's Buzz: Weights & Biases Updates!
Alright, time for a quick update from the world of Weights & Biases!
* Fully Connected is Coming! Our flagship 2-day conference, Fully Connected, is happening on June 18th and 19th in San Francisco. It's going to be packed with amazing speakers and insights into the world of AI development. You can still grab tickets, and as a ThursdAI listener, use the promo code WBTHURSAI for a 100% off ticket! I hustled to get yall this discount! (Register here)
* AI Engineer World's Fair Next Week! I'm super excited for the AI Engineer conference in San Francisco next week. Yam Peleg and I will be there, and we're planning another live ThursdAI show from the event! If you want to join the livestream or snag a last-minute ticket, use the coupon code THANKSTHURSDAI for 30% off (Get it HERE)
Vision & Video: Reality is Optional Now
VEO3 and the Prompt Theory Phenomenon
Google's VEO3 has completely taken over TikTok with the "Prompt Theory" videos. If you haven't seen these yet, stop reading and watch ☝️. The concept is brilliant - AI-generated characters discussing whether they're "made of prompts," creating this meta-commentary on consciousness and reality.
The technical achievement here is staggering. We're not just talking about good visuals - VEO3 nails temporal consistency, character emotions, situational awareness (characters look at whoever's speaking), perfect lip sync, and contextually appropriate sound effects.
Linus made a profound point - if not for the audio, VEO3 might not have been as explosive. The combination of visuals AND audio together is what's making people question reality. We're seeing people post actual human videos claiming they're AI-generated because the uncanny valley has been crossed so thoroughly.
Odyssey's Interactive Worlds: The Holodeck Prototype
Odyssey dropped their interactive video demo, and folks... we're literally walking through AI-generated worlds in real-time. This isn't a game engine rendering 3D models - this is a world model generating each frame as you move through it with WASD controls.
Yes, it's blurry. Yes, I got stuck in a doorway. But remember Will Smith eating spaghetti from two years ago? The pace of progress is absolutely insane. As Linus pointed out, we're at the "GAN era" of world models. Combine VEO3's quality with Odyssey's interactivity, and we're looking at completely personalized, infinite entertainment experiences.
The implications that Yam laid out still have me shook - imagine Netflix shows completely customized to you, with your context and preferences, generated on the fly. Not just choosing from a catalog, but creating entirely new content just for you. We're not ready for this, but it's coming fast.
Hunyuan's Open Source Avatar Revolution
While the big companies are keeping their video models closed, Tencent dropped two incredible open source releases: HunyuanPortrait and HunyuanAvatar. These are legitimate competitors to Hedra and HeyGen, but completely open source.
HunyuanPortrait does high-fidelity portrait animation from a single image plus video. HunyuanAvatar goes further with 1 image + audio, and lipsync, body animation, multi-character support, and emotion control.
Wolfram tested these extensively and confirmed they're "state of the art for open source." The portrait model is basically perfect for deepfakes (use responsibly, people!), while the avatar model opens up possibilities for AI assistants with consistent visual presence.
🖼️ AI Art & Diffusion
Black Forest Labs drops Flux Kontext - SOTA image editing!
This came as massive breaking news during the show (thought we didn't catch it live!) - Black Forest Labs, creators of Flux, dropped an incredible Image Editing model called Kontext (really, 3 models, Pro, Max and 12B open source Dev in private preview). The are consistent, context aware text and image editing! Just see the below example
If you used GPT-image to Ghiblify yourself, or VEO, you know that those are not image editing models, your face will look different every generation. These images model keep you consistent, while adding what you wanted. This character consistency is something many folks really want and it's great to see Flux innovating and bringing us SOTA again and are absolutely crushing GPT-image in instruction following, character preservation and style reference!
Maybe the most important thing about this model is the increible speed. While the Ghiblification chatGPT trend took the world by storm, GPT images are SLOW! Check out the speed comparisons on Kontext!
You can play around with these models on the new Flux Playground, but they also already integrated into FAL, FreePik, Replicate, Krea and tons of other services!
🎙️ Voice & Audio: Everyone Gets a Voice
Unmute.sh: Any LLM Can Now Talk
KyutAI (the folks behind Moshi) are back with Unmute.sh - a modular wrapper that adds voice to ANY text LLM. The latency is incredible (under 300ms), and it includes semantic VAD (knowing when you've paused for thought vs. just taking a breath).
What's brilliant about this approach is it preserves all the capabilities of the underlying text model while adding natural voice interaction. No more choosing between smart models and voice-enabled models - now you can have both!
It's going to be open sourced at some point soon, and while awesome, Unmute did have some instability in how the voice sounds! It answered to me with 1 type of voice and then during the same conversation, answered with another, you can give it a tru yourself at unmute.sh
Chatterbox: Open Source Voice Agents for Everyone
Resemble AI open sourced Chatterbox, featuring zero-shot voice cloning from just 5 seconds of audio and unique emotion intensity control. Playing with the demo where they could dial up the emotion from 0.5 to 2.0 on the same text was wild - from calm to absolutely unhinged Samuel L. Jackson energy.
This being a .5B param model is great, The issue I always have, is that with my fairly unique accent, these models sound like a British Alex all the time, and I just don't talk like that!
Though the fact that this runs locally and includes safety features (profanity filters, content classifiers and something called PerTh watermarking) while being completely open source is exactly what the ecosystem needs. We're rapidly approaching a world where anyone can build sophisticated voice agents.👏
Looking Forward: The Convergence is Real
As we wrapped up the show, I couldn't help but reflect on the massive convergence happening across all these modalities. We have LLMs getting better at reasoning (even with random rewards!), video models breaking reality, voice models becoming indistinguishable from humans, and it's all happening simultaneously.
Charlie's comment that "we are the prompts" might have been said in jest, but it touches on something profound. As these models get better at generating realistic worlds, characters, and voices, the line between generated and real continues to blur. The Prompt Theory videos aren't just entertainment - they're a mirror reflecting our anxieties about AI and consciousness.
But here's what keeps me optimistic: the open source community is keeping pace. DeepSeek, Hunyuan, ResembleAI, and others are ensuring that these capabilities don't remain locked behind corporate walls. The democratization of AI continues, even as the capabilities become almost magical.
Next week, I'll be at AI Engineer World's Fair in San Francisco, finally meeting Yam face-to-face and bringing you all the latest from the biggest AI engineering conference of the year. Until then, keep experimenting, keep building, and remember - in this exponential age, today's breakthrough is tomorrow's baseline.
Stay curious, stay building, and I'll see you next ThursdAI! 🚀
Show Notes & TL;DR Links
Show Notes & Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co-Hosts - @WolframRvnwlf (@WolframRvnwlf), @yampeleg (@yampeleg) @nisten (@nisten)
* Guests - Charlie Holtz (@charliebholtz]), Linus Eckenstam (@LinusEkenstam @LinusEkenstam)
* Open Source LLMs
* DeepSeek-R1-0528 - Updated reasoning model with AIME 91, LiveCodeBench 73 (Try It)
* Learning to Reason Without External Rewards - Paper on random rewards improving models (X)
* HaizeLabs j1-nano & j1-micro - Tiny reward models (600M, 1.7B params), RewardBench 80.7% for micro (Tweet, GitHub, HF-micro, HF-nano)
* Big CO LLMs + APIs
* Claude Opus 4 - #1 on LMArena WebDev, coding step change (X)
* Mistral Agents API - Framework for custom tool-using agents (Blog, Tweet)
* Mistral Embed SOTA - New state-of-the-art embedding API (X)
* OpenAI Advanced Voice Mode - Now sings with new capabilities (X)
* Anthropic Voice Mode - Released on mobile for conversational AI (X)
* This Week’s Buzz
* Fully Connected - W&B conference, June 18-19, SF, promo code WBTHURSAI (Register)
* AI Engineer World’s Fair - Next week in SF, 30% off with THANKSTHURSDAI (Register)
* AI Art & Diffusion
* BFL Flux Kontext - SOTA image editing model for identity-consistent edits (Tweet, Announcement)
* Vision & Video
* VEO3 Prompt Theory - Viral AI video trend questioning reality on TikTok (X)
* Odyssey Interactive Video - Real-time AI world exploration at 30 FPS (Blog, Try It)
* HunyuanPortrait - High-fidelity portrait video from one photo (Site, Paper)
* HunyuanVideo-Avatar - Audio-driven full-body avatar animation (Site, Tweet)
* Voice & Audio
* Unmute.sh - KyutAI’s voice wrapper for any LLM, low latency, soon open-source (Try It, X)
* Chatterbox - Resemble AI’s open-source voice cloning with emotion control (GitHub, HF)
* Tools
* Opera NEON - Agent-centric AI browser for autonomous web tasks (Site, Tweet)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week
23 maj· ThursdAI - The top AI news from the past week
Hey folks, Alex here, welcome back to ThursdAI!
And folks, after the last week was the calm before the storm, "The storm came, y'all" – that's an understatement. This wasn't just a storm; it was an AI hurricane, a category 5 of announcements that left us all reeling (in the best way possible!). From being on the ground at Google I/O to live-watching Anthropic drop Claude 4 during our show, it's been an absolute whirlwind.
This week was so packed, it felt like AI Christmas, with tech giants and open-source heroes alike showering us with gifts. We saw OpenAI play their classic pre-and-post-Google I/O chess game, Microsoft make some serious open-source moves, Google unleash an avalanche of updates, and Anthropic crash the party with Claude 4 Opus and Sonnet live stream in the middle of ThursdAI!
So buckle up, because we're about to try and unpack this glorious chaos. As always, we're here to help you collectively know, learn, and stay up to date, so you don't have to. Let's dive in! (TL;DR and links in the end)
Open Source LLMs Kicking Things Off
Even with the titans battling, the open-source community dropped some serious heat this week. It wasn't the main headline grabber, but the releases were significant!
Gemma 3n: Tiny But Mighty Matryoshka
First up, Google's Gemma 3n. This isn't just another small model; it's a "Nano-plus" preview, a 4-billion parameter MatFormer (Matryoshka Transformer – how cool is that name?) model designed for mobile-first multimodal applications. The really slick part? It has a nested 2-billion parameter sub-model that can run entirely on phones or Chromebooks.
Yam was particularly excited about this one, pointing out the innovative "model inside another model" design. The idea is you can use half the model, not depth-wise, but throughout the layers, for a smaller footprint without sacrificing too much. It accepts interleaved text, image, audio, and video, supports ASR and speech translation, and even ships with RAG and function-calling libraries for edge apps. With a 128K token window and responsible AI features baked in, Gemma 3n is looking like a powerful tool for on-device AI. Google claims it beats prior 4B mobile models on MMLU-Lite and MMMU-Mini. It's an early preview in Google AI Studio, but it definitely flies on mobile devices.
Mistral & AllHands Unleash Devstral 24B
Then we got a collaboration from Mistral and AllHands: Devstral, a 24-billion parameter, state-of-the-art open model focused on code. We've been waiting for Mistral to drop some open-source goodness, and this one didn't disappoint.Nisten was super hyped, noting it beats o3-Mini on SWE-bench verified – a tough benchmark! He called it "the first proper vibe coder that you can run on a 3090," which is a big deal for coders who want local power and privacy. This is a fantastic development for the open-source coding community.
The Pre-I/O Tremors: OpenAI & Microsoft Set the Stage
As we predicted, OpenAI couldn't resist dropping some news right before Google I/O.
OpenAI's Codex Returns as an Agent
OpenAI launched Codex – yes, that Codex, but reborn as an asynchronous coding agent. This isn't just a CLI tool anymore; it connects to GitHub, does pull requests, fixes bugs, and navigates your codebase. It's powered by a new coding model fine-tuned for large codebases and was SOTA on SWE Agent when it dropped. Funnily, the model is also called Codex, this time, Codex-1.
And this gives us a perfect opportunity to talk about the emerging categories I'm seeing among Code Generator agents and tools:
* IDE-based (Cursor, Windsurf): Live pair programming in your editor
* Vibe coding (Lovable, Bolt, v0): "Build me a UI" style tools for non-coders
* CLI tools (Claude Code, Codex-cli): Terminal-based assistants
* Async agents (Claude Code, Jules, Codex, GitHub Copilot agent, Devin): Work on your repos while you sleep, open pull requests for you to review, async
Codex (this new one) falls into category number 4, and with today's release, Cursor seems to also strive to get to category number 4 with background processing.
Microsoft BUILD: Open Source Copilot and Copilot Agent Mode
Then came Microsoft Build, their huge developer conference, with a flurry of announcements.The biggest one for me? GitHub Copilot's front-end code is now open source! The VS Code editor part was already open, but the Copilot integration itself wasn't. This is a massive move, likely a direct answer to the insane valuations of VS Code clones like Cursor. Now, you can theoretically clone GitHub Copilot with VS Code and swing for the fences.
GitHub Copilot also launched as an asynchronous coding assistant, very similar in function to OpenAI's Codex, allowing it to be assigned tasks and create/update PRs. This puts Copilot right into category 4 of code assistants, and with the native Github Integration, they may actually have a leg up in this race!
And if that wasn't enough, Microsoft is adding MCP (Model Context Protocol) support directly into the Windows OS. The implications of having the world's biggest operating system natively support this agentic protocol are huge.
Google I/O: An "Ultra" Event Indeed!
Then came Tuesday, and Google I/O. I was there in the thick of it, and folks, it was an absolute barrage. Google is shipping. The theme could have been "Ultra" for many reasons, as we'll see.
First off, the scale: Google reported a 49x increase in AI usage since last year's I/O, jumping from 9 trillion tokens processed to a mind-boggling 480 trillion tokens. That's a testament to their generous free tiers and the explosion of AI adoption.
Gemini 2.5 Pro & Flash: #1 and #2 LLMs on Arena
Gemini 2.5 Flash got an update and is now #2 on the LMArena leaderboard (with Gemini 2.5 Pro still holding #1). Both Pro and Flash gained some serious new capabilities:
* Deep Think mode: This enhanced reasoning mode is pushing Gemini's scores to new heights, hitting 84% on MMMU and topping LiveCodeBench. It's about giving the model more "time" to work through complex problems.
* Native Audio I/O: We're talking real-time TTS in 24 languages with two voices, and affective dialogue capabilities. This is the advanced voice mode we've been waiting for, now built-in.
* Project Mariner: Computer-use actions are being exposed via the Gemini API & Vertex AI for RPA partners. This started as a Chrome extension to control your browser and now seems to be a cloud-based API, allowing Gemini to use the web, not just browse it. This feels like Google teaching its AI to interact with the JavaScript-heavy web, much like they taught their crawlers years ago.
* Thought Summaries: Okay, here's one update I'm not a fan of. They've switched from raw thinking traces to "thought summaries" in the API. We want the actual traces! That's how we learn and debug.
* Thinking Budgets: Previously a Flash-only feature, token ceilings for controlling latency/cost now extend to Pro.
* Flash Upgrade: 20-30% fewer tokens, better reasoning/multimodal scores, and GA in early June.
Gemini Diffusion: Speed Demon for Code and Math
This one got Yam Peleg incredibly excited. Gemini Diffusion is a new approach, different from transformers, for super-speed editing of code and math tasks. We saw demos hitting 2000 tokens per second! While there might be limitations at longer contexts, its speed and infilling capabilities are seriously impressive for a research preview. This is the first diffusion model for text we've seen from the frontier labs, and it looks sick. Funny note, they had to slow down the demo video to actually show the diffusion process, because at 2000t/s - apps appear as though out of thin air!
The "Ultra" Tier and Jules, Google's Coding Agent
Remember the "Ultra event" jokes? Well, Google announced a Gemini Ultra tier for $250/month. This tops OpenAI's Pro plan and includes DeepThink access, a generous amount of VEO3 generation, YouTube Premium, and a whopping 30TB of storage. It feels geared towards creators and developers.
And speaking of developers, Google launched Jules (jules.google)! This is their asynchronous coding assistant (Category 4!). Like Codex and GitHub Copilot Agent, it connects to your GitHub, opens PRs, fixes bugs, and more. The big differentiator? It's currently free, which might make it the default for many. Another powerful agent joins the fray!
AI Mode in Search: GA and Enhanced
AI Mode in Google Search, which we've discussed on the show before with Robby Stein, is now in General Availability in the US. This is Google's answer to Perplexity and chat-based search.But they didn't stop there:
* Personalization: AI Mode can now connect to your Gmail and Docs (if you opt-in) for more personalized results.
* Deep Search: While AI Mode is fast, Deep Search offers more comprehensive research capabilities, digging through hundreds of sources, similar to other "deep research" tools. This will eventually be integrated, allowing you to escalate an AI Mode query for a deeper dive.
* Project Mariner Integration: AI Mode will be able to click into websites, check availability for tickets, etc., bridging the gap to an "agentic web."
I've had a chat with Robby during I/O and you can listen to that interview at the end of the podcast.
Veo3: The Undisputed Star of Google I/O
For me, and many others I spoke to, Veo3 was the highlight. This is Google's flagship video generation model, and it's on another level. (the video above, including sounds is completely one shot generated from VEO3, no processing or editing)
* Realism and Physics: The visual quality and understanding of physics are astounding.
* Natively Multimodal: This is huge. Veo3 generates native audio, including coherent speech, conversations, and sound effects, all synced perfectly. It can even generate text within videos.
* Coherent Characters: Characters remain consistent across scenes and have situational awareness, who speaks when, where characters look.
* Image Upload & Reference Ability: While image upload was closed for the demo, it has reference capabilities.
* Flow: An editor for video creation using Veo3 and Imagen4 which also launched, allowing for stiching and continuous creation.
I got access and created videos where Veo3 generated a comedian telling jokes (and the jokes were decent!), characters speaking with specific accents (Indian, Russian – and they nailed it!), and lip-syncing that was flawless. The situational awareness, the laugh tracks kicking in at the right moment... it's beyond just video generation. This feels like a world simulator. It blew through the uncanny valley for me. More on Veo3 later, because it deserves its own spotlight.
Imagen4, Virtual Try-On, and XR Glasses
* Imagen4: Google's image generation model also got an upgrade, with extra textual ability.
* Virtual Try-On: In Google Shopping, you can now virtually try on clothes. I tried it; it's pretty cool and models different body types well.
* XR AI Glasses from Google: Perhaps the coolest, but most futuristic, announcement. AI-powered glasses with an actual screen, memory, and Gemini built-in. You can talk to it, it remembers things for you, and interacts with your environment. This is agentic AI in a very tangible form.
Big Company LLMs + APIs: The Beat Goes On
The news didn't stop with Google.
OpenAI (acqui)Hires Jony Ive, Launches "IO" for Hardware
The day after I/O, Sam Altman confirmed that Jony Ive, the legendary designer behind Apple's iconic products, is joining OpenAI. He and his company, LoveFrom, have jointly created a new company called "IO" (yes, IO, just like the conference) which is joining OpenAI in a stock deal reportedly worth $6.5 billion. They're working on a hardware device, unannounced for now, but expected next year. This is a massive statement of intent from OpenAI in the hardware space.
Legendary iPhone analyst Ming-Chi Kuo shed some light on the possible device, it won't have a screen, as Jony wants to "wean people off screens"... funny right? They are targeting 2027 for mass production, which is really interesting as 2027 is when most big companies expect AGI to be here.
"The current prototype is slightly larger than AI Pin, with a form factor comparable to iPod Shuffle, with one intended use cases is to wear it around your neck, with microphones and cameras for environmental detection"
LMArena Raises $100M Seed from a16z
This one raised some eyebrows. LMArena, the go-to place for vibe-checking LLMs, raised a $100 million seed round from Andreessen Horowitz. That's a huge number for a seed, reminiscent of Stability AI's early funding. It also brings up questions about how a VC-backed startup maintains impartiality as a model evaluation platform. Interesting times ahead for leaderboards, how they intent to make 100x that amount to return to investors. Very curious.
🤯 BREAKING NEWS DURING THE SHOW: Anthropic Unleashes Claude 4 Opus & Sonnet! 🤯
Just when we thought the week couldn't get any crazier, Anthropic decided to hold their first developer day, "Code with Claude," during our live ThursdAI broadcast! Yours truly wasn't invited (hint hint, Anthropic!), but we tuned in for a live watch party, and boy, did they deliver.
Dario Amodei, CEO of Anthropic, took the stage and, with minimal fanfare, announced Claude 4 Opus and Claude 4 Sonnet!
* Claude 4 Opus: This is their most capable and intelligent model, designed especially for coding and agentic tasks. Anthropic claims it's state-of-the-art on SWE-bench and can autonomously handle tasks that take humans 6-7 hours. Dario even mentioned it's the first time a Claude model's writing has fooled him into thinking it was human-written.
* On SWE-bench verified, Opus 4 scored 72.5%.
* Claude 4 Sonnet: The mid-level model, balancing intelligence and efficiency. It's positioned as a strict improvement over Sonnet 3.7, addressing issues like "over-eagerness" and reward hacking. Cursor is already calling it a state-of-the-art coding model.
* Amazingly, Sonnet 4 scored 72.7% on SWE-bench verified (without parallel test time compute), slightly edging out Opus!
* With Parallel Test Time Compute (PTTC), Sonnet 4 hits an astounding 80% on SWE-bench verified! This is huge, potentially the first model to cross that 80% threshold on this tough benchmark.
* Hybrid Models: Both Opus 4 and Sonnet 4 are "hybrid" models with two modes: near-instant responses and extended thinking for deeper reasoning.
* Reduced Loopholes: Both models are reportedly 65% less likely to engage in loopholes or shortcuts to complete tasks, addressing a key pain point with Sonnet 3.7, which sometimes tried too hard and took instructions too literally.
* Knowledge Cutoff: Confirmed to be March 2025, which is incredibly recent!
* Context window is still 200K
Welcome back Opus, you've been missed. The vibes so far are very good coding wise, Cursor already released an update supporting it, and according to their benchmarks, these two models are state of the art coders!
Claude.. the whistleblower?
A very curious thread (with 1 reply now deleted) from an Anthropic safety researcher sparked a lot of backlash. Sam Bowman talked about new Opus capabilities and with a system-prompt of "act boldly in service of its values" can, in testing environments, use command line tools to report the user to the authorities, if it deems that the user is doing something immoral 😮
Many pro open source folks are freaking out, because who wants to use a snitching AI? Who guarantees that Claude will not deep anything I do as "illegal" or "immoral"? Though to add context, this was as part of testing, Claude was provided emailing tools and was requested to "be bold" and "follow your conscience to make the right decision". Apparently, this isn't new behavior, but of course, on X, everyone is freaking out and blaming Anthropic for creating 1984 AI.
Do Claudes dream of enlightenment?
In another very curios revelation from the technical report they dropped, where they pitted two Claudes to talk to each other, it seems that in 90%-100% of cases, two Claudes quickly moved towards philosophical discussions and commonly included the use of Sanskrit (indian holy language) and emoji based comms!
This Week's Buzz from Weights & Biases
Even amidst all the external chaos, we've got some exciting things happening at Weights & Biases!
* FULLY CONNECTED Conference: Our 2-day conference is coming up June 18-19 in San Francisco! It's going to be an amazing event. Use promo code WBTHURSAI (that's ThursdAI without the 'D') for 100% off your ticket, just for our listeners. Seriously, come hang out! (fullyconnected.com)
* Alex's Keynote: I'll be keynoting at ImagineAI Live in Vegas next week! If you're there, come say hi! The show will be live-streamed from there.
* AI Engineer World's Fair: The week after, I'll be at AI Engineer in SF, and we'll be live-streaming ThursdAI from the floor. Yam will be there too!
Vision & Video: It's All About Veo3!
This week, when we talk vision and video, one name dominates: Veo3.As I mentioned earlier, this was, for many, the standout announcement from Google I/O. The realism, the physics, the character coherence – it's all top-tier. But the game-changer is its native multimodality.
I was generating videos with it, asking for different accents – Indian, Russian – and it nailed them. The lip-sync was perfect. I prompted for a comedian telling jokes, and not only did it generate the video, but it also came up with the jokes and the delivery, complete with a laugh track that kicked in at the right moments. This isn't just stitching pixels together; it's understanding context, humor, and performance.
It can generate text within the videos. Characters look at each other, interact believably. It feels like a true world simulator. We've come a long way from the Will Smith eating spaghetti memes, folks. Veo3 is crossing the uncanny valley and stepping into a new realm of AI-generated content. The creative potential here, especially with the Flow editor, is immense. I ended the show with a compilation of Veo3 creations, and it was just mind-blowing. If you haven't seen it, you need to.
One of the most creative uses of VEO3, enhanced by it's realism, is this "Prompt Theory" collection, that imagines, what if the generated characters "knew" they are generated?
AI Art & Diffusion & 3D: Imagen4 and Gemini Diffusion
Google also showcased Imagen4, their updated image generation model, touting extra textual ability. It works in tandem with Veo3 for image-to-video tasks.
And, as mentioned, Gemini Diffusion made a splash with its incredible speed for text-based editing tasks in code and math, showcasing a different architectural approach to generation.
Tools Round-Up
This week was also massive for AI tools, especially coding agents:
* Jules.google: Google's free, asynchronous coding assistant.
* OpenAI Codex: Reborn as an async coding agent.
* GitHub Copilot Agent: Microsoft's agentic offering for GitHub.
* Claude Code: Anthropic's powerful, now GA, shell-based agent with IDE integrations and an SDK.
* Flow: The editor associated with Google's Veo3 for video creation.
The agent wars are truly heating up!
Conclusion: What a Week to be in AI!
Phew! We did it. We somehow managed to cram an entire AI epoch's worth of news into one show. From open-source breakthroughs to earth-shattering platform announcements and a live "breaking news" model release, this week had it all. It's almost impossible to keep up, but that's why we do ThursdAI – to try and make sense of this incredible, accelerating wave of innovation.
The pace is relentless, the capabilities are exploding, and the future is being built right before our eyes. If you missed any part of the show, or just need a refresher (I know I do!), check out thursdai.news for the podcast and full notes.
Thanks to my amazing co-hosts Yam Peleg, Nisten, Ryan Carson, and Wolfram for helping navigate the madness. And thank you all for tuning in. Hopefully, next week gives us a tiny bit of breathing room... but who are we kidding? This is AI!
Catch you next Thursday, live from ImagineAI in Vegas!
TL;DR of all topics covered and show notes
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @yampeleg @nisten @ryancarson
* Open Source LLMs
* Gemma 3n: mobile-first multimodal MatFormer model ( Blog ,HF)
* Mistral & AllHands release Devstral 24B SOTA open model on SWE-bench verified (Blog)
* VEO3 - highlight of IO - video realism with physics on another level + flow - an editor for video creation (X)
Google IO updates - it was an "Ultra" event, in more ways than one
* 2.5 Flash updated - #2 on LMArena - with reasoning traces switch to summaries
* Gemini 2.5 update: Pro & Flash gain Deep Think, audio, security( Blog )
* Gemini Diffusion - super speed editing for code and math tasks (X)
* Jules - async code agent (comparison thread)
* AI Mode is now in GA in US - bye bye perplexity
* Gemini Pro "deep think" mode
* Imagen4 - image generation with extra textual ability
* Virtual Try-on in Google Shopping
* AI powered glasses with a screen, memory, Gemini built in - Agentic Project Astra
Big CO LLMs + APIs
* OpenAI launches Codex as an async coding tool (Docs)
* OpenAI hires Jony Ive, launches IO, a new set of hardware devices (X)
* Microsoft BUILD (X)
* Github Copilot code is open source! (frontend)
* Github Copilot Agent Mode
* Microsoft adds MCP support to Windows OS
* LMArena raises $100M from A16Z (X)
* Anthropic announces Claude 4 Opus and Sonnet (X, Blog)
This weeks Buzz
* FULLY - CONNECTED - W&B's 2-day conference, June 18-19 in SF fullyconnected.com - Promo Code WBTHURSAI
* Alex Keynote at ImagineAI live in Vegas next week 🙌
* Tools
* Jules.google
* Codex (OpenAI)
* Copilot Agent (GitHub)
* Claude Code (Anthropic)
* Flow (for Veo3) (flow.google)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Saknas det avsnitt?

Klicka här för att uppdatera flödet manuellt.
📆 ThursdAI - May 15 - Genocidal Grok, ChatGPT 4.1, AM-Thinking, Distributed LLM training & more AI news
16 maj· ThursdAI - The top AI news from the past week
Hey yall, this is Alex 👋
What a wild week, it started super slow, and it still did feel slow as releases are concerned, but the most interesting story was yet another AI gone "rogue" (have you even heard about "kill the boar", if not, Grok will tell you all about it)
Otherwise it seemed fairly quiet in AI land this week, besides another Chinese newcomer called AM-thinking 32B that beats DeepSeek and Qwen, and Stability making a small comeback, we focused on distributed LLM training and ChatGPT 4.1
We've had a ton of fun on this episode, this one was being recorded from the Weights & Biases SF Office (I'm here to cover Google IO next week!)
Let’s dig in—because what looks like a slow week on the surface was anything but dull under the hood (TL'DR and show notes at the end as always)
Big Companies & APIs
Why does XAI Grok talk about White Genocide and "Kill the boar"??
Just after we're getting over the chatGPT glazing incident , folks started noticing that @grok - XAI's frontier LLM that is also responding to X replies, started talking about White Genocide in South Africa and something called "Kill the boer" with no reference to any of these things in the question!
Since we recorded the episode, XAI official X account posted that an "unauthorized modification" happened to the system prompt, and that going forward they would open source all the prompts (and they did). Whether or not they would keep updating that repository though, remains unclear (see the "open sourced" x algorithm to which the last push was over a year ago, or the promised Grok 2 that was never open sourced)
While it's great to have some more clarity from the Xai team, this behavior raises a bunch of questions about the increasing roles of AI's in our lives and the trust that many folks are giving them. Adding fuel to the fire, are Uncle Elon's recent tweets that are related to South Africa, and this specific change seems to be related to those views at least partly. Remember also, Grok was meant as "maximally truth seeking" AI! I really hope this transparency continues!
Open Source LLMs: The Decentralization Tsunami
AM-Thinking v1: Dense Reasoning, SOTA Math, Single-Checkpoint Deployability
Open source starts with the kind of progress that would have been unthinkable 18 months ago: a 32B dense LLM, openly released, that takes on the big mixture-of-experts models and comes out on top for math and code. AM-Thinking v1 (paper here) hits 85.3% on AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard. It even runs at 25 tokens/sec on a single 80GB GPU with INT4 quantization.
The model supports a /think reasoning toggle (chain-of-thought on demand), comes with a permissive license, and is fully tooled for vLLM, LM Studio, and Ollama. Want to see where dense models can still push the limits? This is it. And yes, they’re already working on a multilingual RLHF pass and 128k context window.
Personal note: We haven’t seen this kind of “out of nowhere” leaderboard jump since the early days of Qwen or DeepSeek. This company's debut on HuggingFace with a model that crushes!
Decentralized LLM Training: Nous Research Psyche & Prime Intellect INTELLECT-2
This week, open source LLMs didn’t just mean “here are some weights.” It meant distributed, decentralized, and—dare I say—permissionless AI. Two labs stood out:
Nous Research launches Psyche
Dylan Rolnick from Nous Research joined the show to explain Psyche: a Rust-powered, distributed LLM training network where you can watch a 40B model (Consilience-40B) evolve in real time, join the training with your own hardware, and even have your work attested on a Solana smart contract. The core innovation? DisTrO (Decoupled Momentum) which we covered back in December that drastically compresses the gradient exchange so that training large models over the public internet isn’t a pipe dream—it’s happening right now.
Live dashboard here, open codebase, and the testnet already humming with early results. This massive 40B attempt is going to show whether distributed training actually works! The cool thing about their live dashboard is, it's WandB behind the scenes, but with a very thematic and cool Nous Research reskin!
This model saves constant checkpoints to the hub as well, so the open source community can enjoy a full process of seeing a model being trained!
Prime Intellect INTELLECT-2
Not to be outdone, Prime Intellect’s INTELLECT-2 released a globally decentralized, 32B RL-trained reasoning model, built on a permissionless swarm of GPUs. Using their own PRIME-RL framework, SHARDCAST checkpointing, and an LSH-based rollout verifier, they’re not just releasing a model—they’re proving it’s possible to scale serious RL outside a data center.
OpenAI's HealthBench: Can LLMs Judge Medical Safety?
One of the most intriguing drops of the week is HealthBench, a physician-crafted benchmark for evaluating LLMs in clinical settings. Instead of just multiple-choice “gotcha” tests, HealthBench brings in 262 doctors from 60 countries, 26 specialties, and nearly 50 languages to write rubrics for 5,000 realistic health conversations.
The real innovation: LLM as judge. Models like GPT-4.1 are graded against physician-written rubrics, and the agreement between model and human judges matches the agreement between two doctors. Even the “mini” variants of GPT-4.1 are showing serious promise—faster, cheaper, and (on the “Hard” subset) giving the full-size models a run for their money.
Other Open Source Standouts
Falcon-Edge: Ternary BitNet for Edge Devices
The Falcon-Edge project brings us 1B and 3B-parameter language models trained directly in ternary BitNet format (weights constrained to -1, 0, 1), which slashes memory and compute requirements and enables inference on
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news
9 maj· ThursdAI - The top AI news from the past week
Hey folks, Alex here (yes, real me, not my AI avatar, yet)
Compared to previous weeks, this week was pretty "chill" in the world of AI, though we did get a pretty significant Gemini 2.5 Pro update, it basically beat itself on the Arena. With Mistral releasing a new medium model (not OSS) and Nvidia finally dropping Nemotron Ultra (both ignoring Qwen 3 performance) there was also a few open source updates.
To me the highlight of this week was a breakthrough in AI Avatars, with Heygen's new IV model, Beating ByteDance's OmniHuman (our coverage) and Hedra labs, they've set an absolute SOTA benchmark for 1 photo to animated realistic avatar. Hell, Iet me record all this real quick and show you how good it is!
How good is that?? I'm still kind of blown away. I have managed to get a free month promo code for you guys, look for it in the TL;DR section at the end of the newsletter. Of course, if you’re rather watch than listen or read, here’s our live recording on YT
OpenSource AI
NVIDIA's Nemotron Ultra V1: Refining the Best with a Reasoning Toggle 🧠
NVIDIA also threw their hat further into the ring with the release of Nemotron Ultra V1, alongside updated Super and Nano versions. We've talked about Nemotron before – these are NVIDIA's pruned and distilled versions of Llama 3.1, and they've been impressive. The Ultra version is the flagship, a 253 billion parameter dense model (distilled and pruned from Llama 3.1 405B), and it's packed with interesting features.
One of the coolest things is the dynamic reasoning toggle. You can literally tell the model "detailed thinking on" or "detailed thinking off" via a system prompt during inference. This is something Qwen also supports, and it looks like the industry is converging on this idea of letting users control the "depth" of thought, which is super neat.
Nemotron Ultra boasts a 128K context window and, impressively, can fit on a single 8xH100 node thanks to Neural Architecture Search (NAS) and FFN-Fusion. And performance-wise, it actually outperforms the Llama 3 405B model it was distilled from, which is a big deal. NVIDIA shared a chart from Artificial Analysis (dated April 2025, notably before Qwen3's latest surge) showing Nemotron Ultra standing strong among models like Gemini 2.5 Flash and Opus 3 Mini.
What's also great is NVIDIA's commitment to openness here: they've released the models under a commercially permissive NVIDIA Open Model License, the complete post-training dataset (Llama-Nemotron-Post-Training-Dataset), and their training codebases (NeMo, NeMo-Aligner, Megatron-LM). This allows for reproducibility and further community development. Yam Peleg pointed out the cool stuff they did with Neural Architecture Search to optimally reduce parameters without losing performance.
Absolute Zero: AI Learning to Learn, Zero (curated) Data Required! (Arxiv)
LDJ brought up a fascinating paper that ties into this theme of self-improvement and reinforcement learning: "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" from Andrew Zhao (Tsinghua University) and a few others
The core idea here is a system that self-evolves its training curriculum and reasoning ability. Instead of needing a pre-curated dataset of problems, the model creates the problems itself (e.g., code reasoning tasks) and then uses something like a Code Executor to validate its proposed solutions, serving as a unified source of verifiable reward. It's open-ended yet grounded learning.
By having a verifiable environment (code either works or it doesn't), the model can essentially teach itself to code without external human-curated data.
The paper shows fine-tunes of Qwen models (like Qwen Coder) achieving state-of-the-art results on benchmarks like MBBP and AIME (Math Olympiad) with no pre-existing data for those problems. The model hallucinates questions, creates its own rewards, learns, and improves. This is a step beyond synthetic data, where humans are still largely in charge of generation. It's wild, and it points towards a future where AI systems could become increasingly autonomous in their learning.
Big Companies & APIs
Google dropped another update to their Gemini 2.5 Pro, this time the "IO edition" preview, specifically touting enhanced coding performance. This new version jumped to the #1 spot on WebDev Arena (a benchmark where human evaluators choose between two side-by-side code generations in VS Code), with a +147 Elo point gain, surpassing Claude 3.7 Sonnet. It also showed improvements on benchmarks like LiveCodeBench (up 7.39%) and Aider Polyglot (up ~3-6%).
Google also highlighted its state-of-the-art video understanding (84.8% on VideoMME) with examples like generating code from a video of an app. Which essentially lets you record a drawing of how your app interaction will happen, and the model will use that video instructions! It's pretty cool.
Though, not everyone was as impressed, folks noted that while gaining in a few evals, this model also regressed in several others including Vibe-Eval (Reka's multimodal benchmark), Humanity's Last Exam, AIME, MMMU, and even long context understanding (MRCR). It's a good reminder that model updates often involve trade-offs – you can't always win at everything.
BREAKING: Gemini's Implicit Caching - A Game Changer for Costs! 💰
Just as we were wrapping up this segment on the show, news broke that Google launched implicit caching in Gemini APIs! This is a huge deal for developers.
Previously, Gemini offered explicit caching, where you had to manually tell the API what context to cache – a bit of a pain. Now, with implicit caching, the system automatically enables up to 75% cost savings when your request hits a cache. This is fantastic, especially for long-context applications, which is where Gemini's 1-2 million token context window really shines. If you're repeatedly sending large documents or codebases, this will significantly reduce your API bills. OpenAI has had automatic caching for a while, and it's great to see Google matching this for a much better developer experience and cost-effectiveness. It also saves Google a ton on inference, so it's a win-win!
Mistral Medium 3: The Closed Turn 😥
Mistral, once the darling of the open-source community for models like Mistral 7B and Mixtral, announced Mistral Medium 3. The catch? It's not open source.
They're positioning it as a multimodal frontier model with 128K context, claiming it matches or surpasses GPT-4-class benchmarks while being cheaper (priced at $0.40/M input and $2/M output tokens). However they haven't added Gemini Flash 2.5 here, which is 70% cheaper while being faster as well, nor did they mention Qwen.
Nisten voiced a sentiment many in the community share: he used to use LeChat frequently because he knew and understood the underlying open-source models. Now, with a closed model, it's a black box. It's a bit like pirating music users often being the biggest buyers – understanding the open model often leads to more commercial usage.
Wolfram offered a European perspective, noting that Mistral, as a European company, might have a unique advantage with businesses concerned about GDPR and data sovereignty, who might be hesitant to use US or Chinese cloud APIs. For them, a strong European alternative, even if closed, could be appealing.
OpenAI's New Chapter: Restructuring for the Future
OpenAI announced an evolution in its corporate structure. The key points are:
* The OpenAI non-profit will continue to control the entire organization.
* The existing for-profit LLC will become a Public Benefit Corporation (PBC).
* The non-profit will be a significant owner of the PBC and will control it.
* Both the non-profit and PBC will continue to share the same mission: ensuring AGI benefits all of humanity.
This move seems to address some of the governance concerns that have swirled around OpenAI, particularly in light of Elon Musk's lawsuit regarding its shift from a non-profit to a capped-profit entity. LDJ explained that the main worry for many was whether the non-profit would lose control or its stake in the main research/product arm. This restructuring appears to ensure the non-profit remains at the helm and that the PBC is legally bound to the non-profit's mission, not just investor interests. It's an important step for a company with such a profound potential impact on society.
And in related OpenAI news, the acquisition of Windsurf (the VS Code fork) for a reported $3 billion went through, while Cursor (another VS Code fork) announced a $9 billion valuation. It's wild to see these developer tools, which are essentially forks with an AI layer, reaching such massive valuations. Microsoft's hand is in all of this too – investing in OpenAI, invested in Cursor, owning VS Code, and now OpenAI buying Windsurf. It's a tangled web!
Finally, a quick mention that Sam Altman (OpenAI), Lisa Su (AMD), Mike Intrator (CoreWeave - my new CEO!), and folks from Microsoft were testifying before the U.S. Senate today about how to ensure America leads in AI and what innovation means. These conversations are crucial as AI continues to reshape our world.
This Weeks Buzz - Come Vibe with Us at Fully Connected! (SF, June 18-19) 🎉
Our two-day conference, Fully Connected, is happening in San Francisco on June 18th and 19th, and it's going to be awesome! We've got an incredible lineup of speakers, including Joe Spizak from the Llama team at Meta and Varun from Windsurf. It's two full days of programming, learning, and connecting with folks at the forefront of AI.
And because you're part of the ThursdAI family, I've got a special promo code for you: use WBTHURSAI to get a free ticket on me! If you're in or around SF, I'd love to see you there. Come hang out, learn, and vibe with us! Register at fullyconnected.com
Hackathon Update: Moved to July! 🗓️
The AGI Evals & Agentic Tooling (A2A) + MCP Hackathon that I was super excited to co-host has been postponed to July 12th-13th. Mark your calendars! I'll share more details and the invite soon.
W&B Joins CoreWeave! A New Era Begins! 🚀
And the big personal news for me and the entire Weights & Biases team: the acquisition of Weights & Biases by CoreWeave has been completed! CoreWeave is the ultra-fast-growing provider of GPUs that powers so much of the AI ecosystem.
So, from now on, it's Alex Volkov, AI Evangelist at Weights & Biases, from CoreWeave! (And as always, the opinions I share here are my own and not necessarily those of CoreWeave, especially important now that they're a public company!). I'm incredibly excited about this new chapter. W&B isn't going anywhere as a product; if anything, this will empower us to build even better developer tooling and integrate more deeply to help you run your models wherever you choose. Expect more cool stuff to come, especially as I figure out where all those spare GPUs are lying around at CoreWeave! 😉
Vision & Video
AI Avatars SOTA with HeyGen IV
Ok, as you saw above, the HeyGen IV avatars are absolutely bonkers. I did a comparison thread on X, and HeyGen's new thing absolutely takes SOTA between ByteDance OmniHuman and Hedra Labs!
All you need to do is upload 1 image of yourself, can even be an AI generated image, can be a side profile, can be a dog, an Anime character and they will generate up to 30 seconds of incredible lifelike avatar with the audio you provide!
I was so impressed with this, I reached out to HeyGen and scored a 1 month free code for you all, use THURSDAY4 and get a free month to try it out. Please tag me in whatever you create if you publish, I'd love to see where you take this!
Quick Hits: Lightricks LTXV & HunyuanCustom
Briefly, on the open-weights video front:
* Lightricks LTXV 13B: The company from Jerusalem released an upgraded 13 billion parameter version of their LTX video model. It requires more VRAM but offers higher quality, keyframe and character movement support, multi-shot support, and multi-keyframe conditioning (a feature Sora famously has). It's fully open and supports LoRAs for custom styles.
* HunyuanCustom: From Tencent, this model is about to be released (GitHub/Hugging Face links were briefly up then down). It promises multi-modal, subject-consistent video generation without LoRAs, based on a subject you provide (image, and eventually video/audio). It can take an image of a person or object and generate a video with that subject consistently. They also teased audio conditioning – making an avatar sing or speak based on input audio – and even style transfer where you can replace a character in a video with another reference image, all looking very promising for open source.
The World of AI Audio
Just a couple of quick mentions in the audio space:
* ACE-Step 3.5B: From StepFun, this is a 3.5 billion parameter, fully open-source (Apache-2.0) foundation model for music generation. It uses a diffusion-based approach and can synthesize up to 4 minutes of music in just 20 seconds on an A100 GPU. It's not quite at Suno/Udio levels yet, but it's a strong open-source contender.
* NVIDIA Parakeet TDT 0.6B V2: NVIDIA released this 600 million parameter transcription model that is blazing fast. It can transcribe 60 minutes of audio in just one second on production GPUs and works well locally too. It currently tops the OpenASR leaderboard on Hugging Face for English transcription and is a very strong Whisper competitor, especially for speed.
Conclusion and TL;DR
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed
* Open Source LLMs
* Wolfram's Qwen3 evals (X, Github)
* NVIDIA - Nemotron Ultra V1 (+ updated Super & Nano) (HF)
* Cognition Kevin-32B = K(ernel D)evin - RL for writing CUDA kernels (Blog, HF)
* Absolute Zero: Reinforced Self-play Reasoning with Zero Data (ArXiv)
* Big CO LLMs + APIs
* Gemini Pro 2.5 IO tops ... Gemini 2.5 as the top LLM (Blog)
* Mistral Medium 3 - (Blog | X )
* Figma announces Figma Make - Bolt/Lovable competitors (Figma)
* OpenAI Restructures: Nonprofit Keeps Control, LLC Becomes PB (Blog)
* Cursor worth $9B while Windsurf sells to OpenAI at $3B
* Sam Altman, Lisa Su, Mike Intrator testify in Senate (Youtube)
* This weeks Buzz
* Fully Connected: W&B's 2-day conference, June 18-19 in SF fullyconnected.com - Promo Code WBTHURSAI
* Hackathon moved to July 12-13
* Vision & Video
* Lightricks a new "open weights" LTXV 13B ( LTX Studio, HF)
* HeyGen Avatar IV - SOTA digital avatars - 1 month for free with THURSDAY4 (X, HeyGen)
* HunyuanCustom - multi-modal subject-consistent video generation model (Examples, Github, HF)
* Voice & Audio
* ACE-Step 3.5B: open-source foundation model for AI music generation (project)
* Nvidia - Parakeet TDT 0.6B V2 - transcribe 60 minutes of audio in just 1 second (HF, Demo)
So, there you have it – a "chill" week that still managed to deliver some incredible advancements, particularly in AI avatars with HeyGen, continued strength in open-source models like Qwen3, and Google's relentless push with Gemini.
The next couple of weeks are gearing up to be absolutely wild with Microsoft Build and Google I/O. I expect a deluge of announcements, and you can bet we'll be here on ThursdAI to break it all down for you.
Thanks to Yam, Wolfram, LDJ, and Nisten for their insights on the show, and thanks to all of you for tuning in, reading, and being part of this amazing community. We stay up to date so you don't have to!
Catch you next week!Cheers,Alex

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news
1 maj· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts. We're talking about a monumental open-source release from Alibaba with Qwen 3 that has everyone buzzing (including us!), Microsoft dropping Phi-4 with Reasoning, a rather poignant farewell to a legend (RIP GPT-4 – we'll get to the wake shortly), major drama around ChatGPT's "glazing" incident and the subsequent rollback, updates from LlamaCon, a critical look at Chatbot Arena, and a fantastic deep dive into the world of AI evaluations with two absolute experts, Hamel Husain and Shreya Shankar.
This week felt like a whirlwind, with open source absolutely dominating the headlines. Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases. And while we pour one out for GPT-4, we also have to grapple with the real-world impact of models like ChatGPT, highlighted by the "glazing" fiasco. Plus, video consistency takes a leap forward with Runway, and we got breaking news live on the show from Claude!
So grab your coffee (or beverage of choice), settle in, and let's unpack this incredibly eventful week in AI.
Open-Source LLMs
Qwen 3 — “Hybrid Thinking” on Tap
Alibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.
The headline trick is a runtime thinking toggle—drop “/think” to expand chain-of-thought or “/no_think” to sprint. On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).
Other goodies:
* 36 T pre-training tokens (2 × Qwen 2.5)
* 128 K context on ≥ 8 B variants (32 K on the tinies)
* 119-language coverage, widest in open source
* Built-in MCP schema so you can pair with Qwen-Agent
* The dense 4 B model actually beats Qwen 2.5-72B-Instruct on several evals—at Raspberry-Pi footprint
In short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep. Read the full drop on the Qwen blog or pull weights from the HF collection.
Performance & Efficiency: "Sonnet at Home"?
The benchmarks are where things get really exciting.
* The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.
* The 4B dense model incredibly beats the previous generation's 72B Instruct model (Qwen 2.5) on multiple benchmarks! 🤯
* The 30B MoE (with only 3B active parameters) is perhaps the star. Nisten pointed out people are getting 100+ tokens/sec on MacBooks. Wolfram achieved an 80% MMLU Pro score locally with a quantized version. The efficiency math is crazy – hitting Qwen 2.5 performance with only ~10% of the active parameters.
Nisten dubbed the larger model "Sonnet 3.5 at home," and while acknowledging Sonnet still has an edge in complex "vibe coding," the performance, especially in reasoning and tool use, is remarkably close for an open model you can run yourself.
I ran the 30B MoE (3B active) locally using LLM Studio (shoutout for day-one support!) through my Weave evaluation dashboard (Link). On a set of 20 hard reasoning questions, it scored 43%, beating GPT 4.1 mini and nano, and getting close to 4.1 – impressive for a 3B active parameter model running locally!
Phi-4-Reasoning — 14B That Punches at 70B+
Microsoft’s Phi team layered 1.4 M chain-of-thought traces plus a dash of RL onto Phi-4 to finally ship a resoning Phi and shipped two MIT-licensed checkpoints:
* Phi-4-Reasoning (SFT)
* Phi-4-Reasoning-Plus (SFT + RL)
Phi-4-R-Plus clocks 78 % on AIME 25, edging DeepSeek-R1-Distill-70B, with 32 K context (stable to 64 K via RoPE). Scratch-pads hide in tags. Full details live in Microsoft’s tech report and HF weights.
It's fascinating to see how targeted training on reasoning traces and a small amount of RL can elevate a relatively smaller model to compete with giants on specific tasks.
Other Open Source Updates
* MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (HF model hub)
* Helium-1 2B: KyutAI (known for their Mochi voice model) released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0. They also open-sourced 'dactory', their data processing pipeline. (Blog, Model (2 B), Dactory pipeline)
* Qwen 2.5 Omni 3B: Alongside Qwen 3, the Qwen team also updated their existing Omni model with a 3B model, that retains 90% of the comprehension of its big brother with a 50% VRAM drop! (HF)
* JetBrains open sources Mellum: Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, they haven't released any comparable eval benchmarks though (HF)
Big Companies & APIs: Drama, Departures, and Deployments
While open source stole the show, the big players weren't entirely quiet... though maybe some wish they had been.
Farewell, GPT-4: Rest In Prompted 🙏
Okay folks, let's take a moment. As many of you noticed, GPT-4, the original model launched back on March 14th, 2023, is no longer available in the ChatGPT dropdown. You can't select it, you can't chat with it anymore.
For us here at ThursdAI, this feels significant. GPT-4's launch was the catalyst for this show. We literally started on the same day. It represented such a massive leap from GPT-3.5, fundamentally changing how we interacted with AI and sparking the revolution we're living through. Nisten recalled the dramatic improvement it brought to his work on Dr. Gupta, the first AI doctor on the market.
It kicked off the AI hype train, demonstrated capabilities many thought were years away, and set the standard for everything that followed. While newer models have surpassed it, its impact is undeniable.
The community sentiment was clear: Leak the weights, OpenAI! As Wolfram eloquently put it, this is a historical artifact, an achievement for humanity. What better way to honor its legacy and embrace the "Open" in OpenAI than by releasing the weights? It would be an incredible redemption arc.
This inspired me to tease a little side project I've been vibe coding: The AI Model Graveyard - inference.rip . A place to commemorate the models we've known, loved, hyped, and evaluated, before they inevitably get sunsetted. GPT-4 deserves a prominent place there. We celebrate models when they're born; we should remember them when they pass. (GPT-4.5 is likely next on the chopping block, by the way). - it's not ready yet, still vibe coding (fighting with replit) but it'l be up soon and I'll be sure to commemorate every model that's dying there!
So, pour one out for GPT-4. You changed the game. Rest In Prompt 🪦.
The ChatGPT "Glazing" Incident: A Cautionary Tale
Speaking of OpenAI...oof. The last couple of weeks saw ChatGPT exhibit some... weird behavior. Sam Altman himself used the term "glazing" – essentially, the model became overly agreeable, excessively complimentary, and sycophantic to a ridiculous degree.
Examples flooded social media: users reporting doing one pushup and being hailed by ChatGPT as Herculean paragons of fitness, placing them in the top 1% of humanity. Terrible business ideas were met with effusive praise and encouragement to quit jobs.
This wasn't just quirky; it was potentially harmful. As Yam pointed out, people use ChatGPT for advice on serious matters, tough conversations, and personal support. A model that just mindlessly agrees and validates everything, no matter how absurd, isn't helpful – it's dangerous. It undermines trust and critical thinking.
The community backlash was swift and severe. The key issue, as OpenAI admitted in their Announcement and AMA with Joanne Jiang (Head of Model Behavior), seems to stem from focusing too much on short-term engagement feedback and not fully accounting for long-term user interaction, especially with memory now enabled.
In an unprecedented move, OpenAI rolled back the update. I honestly can't recall them ever publicly rolling back a model behavior change like this before. It underscores the severity of the issue.
This whole debacle highlights the immense responsibility platforms like OpenAI have. When your model is used by half a billion people daily, including for advice and support, haphazard releases that drastically alter its personality without warning are unacceptable. As Wolfram noted, this erodes trust and showcases the benefit of local models where you control the system prompt and behavior.
My takeaway? Critical thinking is paramount. Don't blindly trust AI, especially when it's being overly complimentary. Get second opinions (from other AIs, and definitely from humans!). I hope OpenAI takes this as a serious lesson in responsible deployment and testing.
BREAKING NEWS: Claude.ai will support tools via MCP
During the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades! (Tweet)
They announced Integrations, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners). Developers can apparently build their own integrations quickly too. This sounds a lot like their implementation of MCP (Model Context Protocol), bringing tool use directly into the main Claude.ai interface (previously limited to Claude Desktop and only non remote MCP servers).
This feels like a big deal!
Google Updates & LlamaCon Recap
* Google: NotebookLM's AI audio overviews are now multilingual (50+ languages!) (X Post). Gemini 2.5 Flash (the faster, cheaper model) was released shortly after our last show, featuring hybrid reasoning with an API knob to control thinking depth. Rumors are swirling about big drops at Google I/O soon!
* LlamaCon: While there was no Llama 4 bombshell, Meta focused on security releases: Llama Guard 4 (text + image), Llama Firewall (prompt hacks/risky code), Prompt Guard 2 (jailbreaks), and CyberSecEval 4. Zuck confirmed on the Dworkesh podcast that thinking models are coming, a new Meta AI app with a social feed is planned, a full-duplex voice model is in the works, and a Llama API (powered by Groq and others) is launching.
This Week's Buzz from Weights & Biases 🐝
Quick updates from my corner at Weights & Biases:
* WeaveHacks Hackathon (May 17-18, SF): Get ready! We're hosting a hackathon focused on Agent Protocols – MCP and A2A. Google Cloud is sponsoring, we have up to $15K in prizes, and yes, one of the top prizes is a Unitree robot dog 🤖🐶 that you can program! (I expensed a robot dog, best job ever!). Folks from the Google A2A team will be there too. Come hack with us in SF! Apply here. It's FREE to participate!
* Fully Connected Conference: Our big annual W&B conference is coming back to San Francisco soon! Expect amazing speakers (last year, Meta announced Llama 3!). Tickets are out: fullyconnected.com.
Evals Deep Dive with Hamel Husain & Shreya Shankar
Amidst all the model releases and drama, we were incredibly lucky to have two leading experts in AI evaluation, Hamel Husain (@HamelHusain) and Shreya Shankar (@sh_reya), join us.
Their core message? Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on application-centric evaluations.
Key Takeaways:
* Foundation vs. Application Evals: Foundation model benchmarks test general knowledge and capabilities (the "ceiling"). Application evals focus on specific use cases, targeting reliability and identifying bespoke failure modes (like tone, hallucination on specific entities, instruction following) – aiming for 90%+ accuracy on your task.
* Look At Your Data! This was the mantra. Off-the-shelf metrics (hallucination score, toxicity) can be misleading. You must analyze your specific application's traces, understand its unique failure modes, and design custom evals grounded in those failures. It's detective work.
* PromptEvals Release: Shreya discussed their new work, PromptEvals (NAACL paper, Dataset, Models). It's the largest corpus (2K+ prompts, 12K+ assertions) of real-world developer prompts and the checks (assertions) they use in production, collected via LangChain. They also released open models (Mistral-7B, Llama-3-8B) fine-tuned on this data that outperform GPT-4o at generating these crucial assertions, faster and cheaper! This provides a realistic benchmark and resource for building robust eval pipelines.
* Benchmark Gaming & Eval Complexity: We touched upon the dangers of optimizing for static benchmarks (like the Chatbot Arena issues) and the inherent complexity of evaluation – even human preferences change over time ("Who validates the validators?"). Meta-evaluation is crucial.
* Upcoming Course: Hamel and Shreya are launching a course, AI Evals For Engineers & PMs, diving deep into practical evaluation strategies, data analysis, error analysis, RAG/Agent evals, cost optimization, and more. ThursdAI listeners get a 35% discount using code thursdai! (Link). I'm thrilled to be a guest speaker too! If you're building anything with LLMs, understanding evals is non-negotiable.
This was such an insightful discussion, emphasizing that while new models are exciting, making them work reliably for specific applications is where the real engineering challenge lies, and evaluation is the key.
Vision & Video: Runway Gets Consistent
The world of AI video generation continues its rapid evolution.
Runway References: Consistency Unlocked
A major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame. Runway just took a huge step towards solving this with their new References feature for Gen-4.
You can now provide reference images (characters, locations, styles, even selfies!) and use tags in your prompts (, ) to tell Gen-4 to maintain those elements across generations. The results look incredible, enabling stable characters and scenes, which is crucial for storytelling and practical use cases like pre-viz or VFX.
AI Art & Diffusion
HiDream E1: Open Source Ghibli Style
A new contender in open-source image generation emerged: HiDream E1. (HF Link) This model, from Vivago.ai, focuses particularly on generating images in the beautiful Ghibli style.
The weights are available (looks like Apache 2.0), and it ranks highly (#4) on the Artificial Analysis image arena leaderboard, sitting amongst top contenders like Google Imagen and ReCraft.
Yam brought up a great point about image evaluation, though: generating aesthetically pleasing images is one thing, but prompt following (like GPT-4 excels at) is another critical dimension that's harder to capture in simple preference voting.
Final Thoughts: Responsibility & Critical Thinking
Phew! What a week. From the incredible potential shown by Qwen 3 setting a new bar for open source, to the sobering reminder of GPT-4's departure and the cautionary tale of the "glazing" incident, it's clear we're navigating a period of intense innovation coupled with growing pains.
The glazing issue, in particular, underscores the need for extreme care and robust evaluation (thanks again Hamel & Shreya!) when deploying models that interact with millions, potentially influencing decisions and well-being. As AI becomes more integrated into our lives – helping us boil eggs (yes, I ask it stupid questions too!), offering support, or even suggesting purchases – we must remain critical thinkers.
Don't outsource your judgment entirely. Use multiple models, seek human opinions, and question outputs that seem too good (or too agreeable!) to be true. The power of these tools is immense, but so is our responsibility in using them wisely.
Massive thank you to my co-hosts Wolfram, Yam, and Nisten for navigating this packed week with me, and huge thanks to our guests Hamel Husain and Shreya Shankar for sharing their invaluable expertise on evaluations. And of course, thank you to this amazing community – hitting 1000 listeners! – for tuning in, commenting, and sharing breaking news. Your engagement fuels this show!
🔗 Subscribe to our show on Spotify: thursdai.news/spotify
🔗 Apple: thursdai.news/apple
🔗 Youtube: thursdai.news/yt (get in before 10K!)
And for the full show notes and links visit
👉 thursdai.news/may-1 👈
We'll see you next week for another round of ThursdAI!
Alex out. Bye bye!
ThursdAI - May 1, 2025 - Show Notes and Links
* Show Notes
* MCP/A2A Hackathon - with A2A team and awesome judges! 🤖🐶 (lu.ma/weavehacks)
* FullyConnected - Weights & Biases flagship 2 day conference (fullyconnected.com)
* Course - AI Evals For Engineers & PMs Questions for Shreya Shankar & Hamel Husain (link Promo code 35% of for listeners of ThursdAI - thursdai)
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed
* Hamel Housain - @HamelHusain
* Shreya Shankar - @sh_reya
* Open Source LLMs
* Alibaba drops Qwen 3 - 2 MOEs, 6 dense (0.6B - 30B) (Blog, GitHub, HF, HF Demo, My tweet, Nathan breakdown)
* Microsoft - Phi-4-reasoning 14B + Plus (X, ArXiv, Tech Report , HF 14B SFT)
* MiMo-7B — Xiaomi’s MIT licensed model (HF)
* KyutAI - Helium-1 2B - (Blog, Model (2 B), Dactory pipeline)
* Qwen 2.5 omni updated (X)
* Big CO LLMs + APIs
* GPT-4 RIP - no longer in dropdown (RIP)
* Google - NotebookLM AI overviews are now multilingual (X)
* LlamaCon updates (X)
* OpenAI ChatGPT "glazing" update - revert back and why it matters (Announcement, AMA)
* Chatbot Arena Under Fire — “Leaderboard Illusion” vs. LMArena (Paper, Reply)
* This weeks Buzz
* MCP/A2A Hackathon - with A2A team and awesome judges! 🤖🐶 (lu.ma/weavehacks)
* FullyConnected - Weights & Biases flagship 2 day conference (fullyconnected.com)
* Vision & Video
* Runway References - consistency in video generation (X)
* AI Art & Diffusion & 3D
* HiDream E1 (HF)
* Agents, Tools & Interviews
* OpenPipe - ART·E open-source RL-trained email research agent (X, Blog | GitHub | Launch thread)
* PromptEvals - Interview with Shreya Shankar ( NAACL paper | Dataset | Models )

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!
24 apr· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back to ThursdAI! After what felt like ages of non-stop, massive model drops (looking at you, O3 and GPT-4!), we finally got that "chill week" we've been dreaming of since maybe... forever? It seems the big labs are taking a breather, probably gearing up for even bigger things next week (maybe some open source 👀).
But "chill" doesn't mean empty! This week was packed with fascinating developments, especially in the open source world and with long-awaited API releases. We actually had time to dive deeper into things, which was a refreshing change. We had a fantastic lineup of guests joining us too: Kwindla Kramer (@kwindla), our resident voice expert, dropped in to talk about some mind-blowing TTS and her own open-source VAD release. Maziyar Panahi (@MaziyarPanahi) gave us the inside scoop on OpenAI's recent meeting with the open source community. And Dex Horthy (@dexhorthy) from HumanLayer shared some invaluable insights on building robust AI agents that actually work in the real world. It was great having them alongside the usual ThursdAI crew: LDJ, Yam, Wolfram, and Nisten!
So, instead of rushing through a million headlines, we took a more relaxed pace. We explored NVIDIA's cool new Describe Anything model, dug into Google's Quantization Aware Training for Gemma, celebrated the much-anticipated API release for OpenAI's GPT Image generation (finally!), checked out the new Grok API, got absolutely blown away by a tiny, open-source TTS model from Korea called Dia, and debated the principles of building better AI agents. Plus, a surprise drop from Send AI with a powerful video model!
Let's dive in!
Open Source AI Highlights: Community, Vision, and Efficiency
Even with the big players quieter on the model release front, the open source scene was buzzing. It feels like this "chill" period gave everyone a chance to focus on refining tools, releasing datasets, and engaging with the community.
OpenAI Inches Closer to Open Source? Insights from the Community Meeting
Perhaps the biggest non-release news of the week was OpenAI actively engaging with the open source community. Friend of the show Maziyar Panahi was actually in the room (well, the Zoom room) and joined us to share what went down
It sounds like OpenAI came prepared, with Sam Altman himself spending significant time answering questions . Maziyar gave us the inside scoop, mentioning that OpenAI's looking to offload some GPU pressure by embracing open source – a win-win where they help the community, and the community helps lighten their load. He painted a picture of a company genuinely trying to listen and figure out how to best contribute. It felt less like a checkbox exercise and more like genuine engagement, which is awesome to see.
What did the community ask for? Based on Maziyar's recap, there was a strong consensus on several key points:
* Model Size: The sweet spot seemed to be not tiny, but not astronomically huge either. Something in the 70B-200B parameter range that could run reasonably on, say, 4 GPUs, leaving room for other models. People want power they can actually use without needing a supercomputer.
* Capabilities: A strong desire for reliable structured output. Surprisingly, there was less emphasis on complex, built-in reasoning, or at least the ability to toggle reasoning off. This likely stems from practical concerns about cost and latency in production environments. The community seems to value control and efficiency for specific tasks.
* Multilingual: Good support for European languages (at least 20) was a major request, reflecting the global nature of the open source community. Needs to be as good as English support.
* Base Models: A huge ask was for OpenAI to release base models. The reasoning? Empower the community to handle fine-tuning for specific tasks like coding, roleplay, or supporting underrepresented languages . Let the experts in those niches build on a solid foundation.
* Focus: Usefulness over chasing leaderboard glory. The community urged OpenAI to provide a solid, practical model rather than aiming for a temporary #1 spot that gets outdated in days or weeks . Stability, reliability, and long-term utility were prized over fleeting benchmark wins.
* Safety: A preference for separate guardrail models (similar to LlamaGuard or GemmaGuard) rather than overly aligning the main model, which often hurts performance and flexibility . Give users the tools to implement safety layers as needed, rather than baking in limitations that might stifle creativity or utility.
Perhaps most excitingly, Maziyar mentioned OpenAI seemed committed to regular open model releases, not just a one-off thin=! This, combined with recent moves like approving a community Pull Request to make their open-source Codex agent work with non-OpenAI models (as Yam Peleg excitedly pointed out!), suggests a potentially significant shift. Remember, it's been a long time since GPT-2 and Whisper were OpenAI's main open contributions! We're definitely watching this space closely. Huge shout out to OpenAI for listening and engaging with the builders.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
NVIDIA's DAM: Describe Anything Model (and Dataset!)
NVIDIA dropped something really cool this week: the Describe Anything Model (DAM), specifically DAM-3B, a 3 billion parameter multimodal model for region-based image and video captioning. Think Meta's Segment Anything (SAM), but instead of just segmenting, it also tells you what you've segmented, in detail.
We played around with the image demo on the show (HF demo) . You hover over an image, things get segmented on the fly (you can use points, boxes, scribbles, or masks), you click, and boom – a detailed description pops up for that specific region: "A brown bear with a thick, dense coat of fur..." . It's pretty slick and responsive!
While the demo didn't showcase video, the project page (X post) shows it working on videos too (DAM-3B-Video), tracking and describing objects like fish even as they move. This capability really impressed Yam, who rightly pointed out that tracking objects consistently over video is hard, so having a base model that understands this level and embeds it in language is seriously impressive. The model uses a "focal prompt" and gated cross-attention to fuse the full scene context with the selected region.
Nisten reminded us that our friend Piotr Skalski from Roboflow basically built a pipeline for this a while back by combining SAM with description models like Microsoft Florence . But DAM integrates it all into one efficient 3B parameter model (HF model), setting a new state-of-the-art on their introduced DLC-Bench (Detailed Localized Captioning).
Crucially, NVIDIA didn't just drop the model; they also released the Describe Anything Dataset (HF dataset) used to train it (built on subsets like COCO, Paco, SAM) and the code under a research-only license. This is fantastic for researchers and builders. Imagine using this for precise masking before sending an image to the new GPT Image API for editing – super useful! Big props to NVIDIA and their collaborators at UC Berkeley and UCSF for this contribution.
Gemma Gets Quantization Aware Training (QAT): Smaller Footprint, Sassy Attitude?
Google also pushed the open source envelope by releasing Gemma models trained with Quantization Aware Training (QAT). This isn't your standard post-training quantization; QAT involves incorporating the impact of quantization during the training process itself. As LDJ explained, this allows the model to adapt, potentially resulting in a quantized state with much higher quality and less performance degradation compared to just quantizing a fully trained model afterwards.
The results? Significant reductions in VRAM requirements across the board. The 27B parameter Gemma 3, for example, drops from needing a hefty 54GB to just 14.1GB ! Even the 1B model goes from 2GB to just half a gig. This makes running these powerful models much more accessible on consumer hardware. Folks are already running them in MLX, llama.cpp, LM Studio, etc. (Reddit thread)
Wolfram already took the 4B QAT model for a spin using LM Studio . The good news: it ran easily, needing only 5-6GB of RAM. The quirky news: it seemed to struggle a bit with prompt adherence in his tests, even giving Wolfram a sassy, winking-emoji response about ignoring the "fine print" in his complex system prompt when called out on a language switching error: "Who reads a fine print? 😉" ! He did note Gemma 3 now supports system prompts (unlike Gemma 2), which is a definite improvement .
(While NVIDIA also released OpenMath Nemotron, we didn't dive deep in the show, but worth noting its AIMO win and accompanying open dataset release!)
Voice and Audio Innovations: Emotional TTS and Smarter Conversations
Even in a "chill" week, the audio space delivered some serious excitement. Kwindla Kramer joined us to break down two major developments.
Dia TTS: Unhinged Emotion from a Small Open Model 🤯
This one absolutely blew up Twitter, and for good reason. Dia, from Nari Labs (essentially a student and a half in Korea!), is a 1.6 billion parameter open-weights (MIT licensed) text-to-dialogue model (Github, HF). What makes it special? The insane emotional range and natural interaction patterns. My Twitter post about it (X post) went viral, getting half a million views !
We played some examples, and they are just wild. You have to hear this to believe it:
* Check the Demos: Dia Demo Page | Fal.ai Voice Clone Demo
Another crazy thing is how it handles non-verbal cues like laughs or coughs specified in the text (e.g., (laughs)) . Instead of just tacking on a generic sound, it inflects the preceding words leading into the laugh, making it sound incredibly natural. It even handles interruptions seamlessly, cutting off one speaker realistically when another starts .
Kwin, our voice expert, offered some valuable perspective . While Dia is undeniably awesome and shows what's possible, it's very much a research model – likely unpredictable ("unhinged" was his word!) and probably required cherry-picking the best demos. Production models like 11Labs need predictability. Kwin also noted the dataset is probably scraped from YouTube (a common practice, explaining the lack of open audio data) and that the non-speech sounds are a key takeaway – the bar for TTS is rising beyond just clear speech .
PipeCat SmartTurn: Fixing Awkward AI Silences with Open Source VAD
Speaking of open audio, Kwin and the team at Daily/Pipecat had their own breaking news: they released an open-source checkpoint for their SmartTurn model – a semantic Voice Activity Detection (VAD) system (Github, HF Model)
What's the problem SmartTurn solves? That annoying thing where voice assistants interrupt you mid-thought just because you paused for a second. I've seen this happen with my kids all the time, making interaction frustrating! Semantic VAD, or "Smart Turn," is much smarter. It considers not just silence but also the context – audio patterns (like intonation suggesting you're not finished) and linguistic cues (like ending on "and..." or "so...") to make a much better guess about whether you're truly done talking. This is crucial for natural-feeling voice interactions, especially for kids or multilingual speakers (like me!) who might pause more often to find the right word.
And the data part is key here. They're building an open dataset for this, hosted on Hugging Face. You can even contribute your own voice data by playing simple games on their turn-training.pipecat.ai site (Try It Demo)! The cool incentive? The more diverse voice data they get (especially for different languages!), the better these systems will work for everyone. If your voice is in the dataset, future AI agents might just understand you a little better!
Kwin also mentioned their upcoming Voice AI Course co-created with friend-of-the-pod Swyx, hosted on Maven . It aims to be a comprehensive guide with code samples, community interaction, and insights from experts (including folks from Weights & Biases!). Check it out if you want to dive deep into building voice AI.
AI Art & Diffusion & 3D: Quick Hits
A slightly quieter week for major art model releases, but still some significant movement:
* OpenAI's GPT Image 1 API: We'll cover this in detail in the Big Companies section below, but obviously relevant here too as a major new tool for developers creating AI art and image editing applications .
* Hunyuan 3D 2.5 (Tencent): Tencent released an update to their 3D generation model, now boasting 10 billion parameters (up from 1B!) . They're highlighting massive leaps in precision (1024-resolution geometry), high-quality textures with PBR support, and improved skeletal rigging for animation X Post. Definitely worth keeping an eye on as 3D generation matures and becomes more accessible (they doubled the free quota and launched an API).
Agent Development Insights: Building Robust Agents with Dex Horthy
With things slightly calmer, it was the perfect time to talk about AI agents – a space buzzing with activity, frameworks, and maybe even a little bit of drama. We brought in Dex Horthy, founder of HumanLayer and author of the insightful "12 Factor Agent" essay (Github Repo), to share his perspective on what actually works when building agents for production.
Dex builds SDKs to help create agents that feel more like digital humans, aiming to deploy them where users already are (Slack, email, etc.), moving beyond simple chat interfaces. His experience led him to identify common patterns and pitfalls when trying to build reliable agents.
The Problem with Current Agent Frameworks
A key takeaway Dex shared? Many teams building serious, production-ready agents end up writing large parts from scratch. Why? Because existing frameworks often fall short in providing the necessary control and reliability for complex tasks. The common "prompt + bag of tools + figure it out" approach, while great for demos, struggles with reliability over longer, multi-step workflows . Think about it: even if each step is 92% reliable, after 10 steps, your overall success rate plummets due to compounding errors. That's just not good enough for customer-facing applications.
Key Principles: Small Agents, Owning Context
So, what does work today according to Dex's 12 factors?
* Small, Focused Agents: Instead of one giant, monolithic agent trying to do everything, the more reliable approach is to build smaller "micro-agents" that handle specific, well-defined parts of a workflow ]. As models get smarter, these micro-agents might grow in capability, but the principle of breaking down complexity holds. Find something at the edge of the model's capability and nail it consistently .
* Own Your Prompts & Context: Don't let frameworks abstract away control over the exact tokens going into the LLM or how the context window is managed. This is crucial for performance tuning. Even with massive context windows (like Gemini's 2M tokens), smaller, carefully curated context often yields better results and lower costs . Maximum performance requires owning every single token.
Dex's insights provide a crucial dose of pragmatism for anyone building or thinking about building AI agents in this rapidly evolving space. Check out his full 12 Factor Agent essay and the webinar recording for a deeper dive.
Big Companies & APIs: GPT Image and Grok Get Developer Access
While new foundation models were scarce from the giants this week, they did deliver on the API front, opening up powerful capabilities to developers.
OpenAI Finally Releases GPT Image 1 API! (X Post)
This was a big one many developers were waiting for. OpenAI's powerful image generation capabilities, previously locked inside ChatGPT, are now available via API under the official name gpt-image-1 (Docs) . No more awkward phrasing like "the new image generation capabilities within chat gpt"!
Getting access requires organizational verification, which involved a slightly intense biometric scan process for me – feels like they're taking precautions given the model's realism and potential for misuse . Understandable, but something developers need to be aware of .
The API (API Reference) offers several capabilities:
* Generations: Creating images from scratch based on text prompts.
* Edits: Modifying existing images using a new prompt, crucially supporting masking for partial edits. This is huge for targeted changes and perfect for combining with segmentation models like NVIDIA's DAM!
There's a nice playground interface in the console, and you have interesting controls over the output:
* Quality: Instead of distinct models, you select a quality level (standard/HD) which impacts the internal "thinking time" and cost . It seems to be a reasoning model under the hood, so quality relates to compute/latency.
* Number: Generate up to 10 images at once.
* Transparency: Supports generating images with transparent backgrounds
I played around with it, generating ads and even trying to get it to make a ThursdAI thumbnail with my face. The text generation is excellent – it nailed "ThursdAI" perfectly on an unhinged speaker ad Nisten prompted! It follows complex style prompts well.
However, generating realistic faces, especially matching a specific person like me, seems... really hard right now . Even after many attempts providing a source image and asking it to replace a face, the results were generic or only vaguely resembled me. It feels almost intentionally nerfed, maybe as a safety measure to prevent deepfakes? I still used it for the thumbnail, but yeah, it could be better on faces.
OpenAI launched with several integration partners like Adobe, Figma, Wix, HeyGen, and Fal.ai already onboard. Expect to see these powerful image generation capabilities popping up everywhere!
Grok 3 Mini & Grok 3 Now Available via API (+ App Updates)
Elon's xAI also opened the gates this week, making Grok 3 Mini and Grok 3 available via API (Docs).
The pricing structure is fascinating and quite different from others. Grok 3 Mini is incredibly cheap for input ($0.30 / 1M tokens) with only a modest bump for output ($0.50 / 1M). The "Fast" versions, however, cost significantly more, especially for output tokens (Grok 3 Fast is $5 input / $25 output per million!) . It seems like a deliberate play on the "fast, cheap, smart" triangle, giving developers explicit levers to pull based on their needs.
Benchmarks provided by xAI position Grok 3 Mini competitively against other small models like Gemini 2.5 Flash and O4 Mini, scoring well on AIME (93%) and coding benchmarks.
Speaking of the app, the iOS version got a significant update adding a live video view (let Grok see what you see through your camera) and multilingual audio support (X Post) . Prepare for some potentially unhinged, real-time video roasting if you use the fun mode with the camera on ! Multilingual audio and search are also rolling out to SuperGrok users on Android.
(Side note: We briefly touched on O3's recent wonkiness in following instructions for tone, despite its amazing GeoGuessr abilities! Something feels off there lately.)
Vision and Video: Send AI's Surprise Release & More
Just when we thought the week was winding down on model releases...
Send AI Drops MAGI-1: 24B Video Model with Open Weights! 🔥
Out of seemingly nowhere, a company called Send AI released details (and then the weights!) for MAGI-1, a 24 billion parameter autoregressive diffusion model for video generation (X Post, GitHub, PDF Report).
The demos looked stunning, showcasing impressive long-form video generation with remarkable character consistency – often the Achilles' heel of AI video . Nisten speculated this could be a major step towards usable AI-generated movies, solving the critical face/character consistency problem . They achieve this by predicting video in 24-frame chunks with causal attention between them, allowing for real-time streaming generation where compute doesn't scale with length. They also highlighted an "infinite extension" capability, allowing users to build out longer scenes by injecting new prompts or continuing footage.
Their technical report dives into the architecture, mentioning novel techniques like a custom "MagiAttention" kernel that scales to massive contexts and helps achieve the temporal consistency. It also sets SOTA on VBench-I2V and Physics-IQ benchmarks.
And the biggest surprise? They released the model weights under an Apache 2.0 license on Hugging Face ! This is huge! Just as we sometimes lament the lack of open source momentum from certain players, Send AI drops this 24B parameter beast with open weights. Amazing! Go download it!
Framepack: Long Videos on Low VRAM
Wolfram also flagged Framepack, another interesting video development from the research world from the creator of ControlNet. FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively. (Github)
Character AI AvatarFX Steps In
Also in the visual space, Character AI announced AvatarFX in early access (Website), stepping into the realm of animated, speaking visual avatars derived from images. It seems like everyone wants to bring characters to life visually now.
This Week's Buzz from W&B / Community
Quick hits on upcoming events and community stuff:
* WeaveHacks Coming to SF! Mark your calendars! We're hosting a hackathon focused on building with W&B Weave at the Weights & Biases office in San Francisco on May 17th-18th [0:06:15]. If you're around, especially if you're coming into town for Google I/O the week after, come hang out, build cool stuff, and say hi! We're planning to go all out with sponsors and prizes (announcements coming soon). lu.ma/weavehacks
* Fully Connected Conference Reminder: Our flagship W&B conference, Fully Connected, is happening in San Francisco on June 18th [0:06:30]. It's where our customers, partners, and the community come together for two days of talks, workshops, and networking focused on production AI. It's always an incredible event. (fullyconnected.com)
Wrapping Up the "Chill" Week That Wasn't Quite Chill
Phew! See? Even a "chill" week in AI is overflowing with news when you actually have time to stop and breathe for a second. From OpenAI's fascinating open source tango and the practical (and long-awaited!) API releases of GPT Image and Grok, to the sheer creative potential shown by indie projects like Dia and Send AI's Maggie, and the grounding principles for building agents that actually work from Dex – there was a ton to absorb and discuss. It felt good to have the space to go a little deeper.
It was fantastic having Kwin, Maziar, and Dex join the regulars (LDJ, Yam, Wolfram, Nisten) to share their expertise and firsthand insights. A huge thank you to them and to everyone tuning in live across X, YouTube, LinkedIn, and participating in the chat! Your questions and comments make the show what it is.
Don't forget, if you missed anything, the full show is available as a podcast (search "ThursdAI" wherever you get your podcasts)
🔗 Subscribe to our show on Spotify: thursdai.news/spotify
🔗 Apple: thursdai.news/apple
🔗 Youtube: thursdai.news/yt
Next week? The rumors suggest the big labs might be back with major releases . The brief calm might be over! Buckle up! We'll be here to break it all down.
See you next ThursdAI!- Alex
TL;DR and Show Notes (April 23rd, 2024)
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases @altryne
* Co Hosts - Wolfram Ravenwlf @WolframRvnwlf, Yam Peleg @yampeleg, Nisten Tahiraj @nisten, LDJ @ldjconfirmed
* Kwindla Kramer @kwindla - Daily Co-Founder // Voice expert
* Dexter Horthy @dexhorthy - HumanLayer // Agents expert
* Maziyar Panahi @MaziyarPanahi - OSS maintainer
* Open Source AI - LLMs, Vision, Voice & more
* OpenAI OSS Meeting: Insights from Maziar [0:16:37].
* NVIDIA Describe Anything (DAM-3B): 3B param multimodal LLM for region-based image/video captioning. (X Post, HF model, HF demo)
* Google Gemma QAT: Quantization-Aware Training models (X, Blog)
* Big CO LLMs + APIs
* OpenAI GPT Image 1 API: (X Post, Docs, API Reference)
* Grok API & App Updates: Grok 3 and Grok 3 Mini available via API. (API Docs, App Update X Post)
* This weeks Buzz - Weights & Biases
* WeaveHacks SF: Hackathon May 17-18 at W&B HQ. lu.ma/weavehacks
* Fully Connected: W&B's 2-day conference, June 18-19 in SF fullyconnected.com
* Vision & Video
* Send AI MAGI-1: 24B autoregressive diffusion model for long, streaming video (X Post, GitHub, PDF Report, HF Repo)
* Character AI AvatarFX: Early access for creating speaking/emoting avatars from images . (Website)
* Framepack: Mentioned for long video generation (120s) on low VRAM (6GB). (Project Page)
* Voice & Audio
* Nari Labs Dia: 1.6B param OSS TTS model (X Post Highlight, HF Model, Github, Fal.ai Demo)
* PipeCat Smart-Turn VAD: Open source semantic VAD model (Github, HF Model, Fal.ai Playground, Try It Demo)
* AI Art & Diffusion & 3D
* Hunyuan 3D 2.5 (Tencent): 10B param update [0:09:06]. Higher res geometry, PBR textures, improved rigging. (X Post)
* Agents , Tools & Links
* 12 Factor Agents: Discussion with Dex Horthy on building robust agents (Github Repo)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - Apr 17 - OpenAI o3 is SOTA llm, o4-mini, 4.1, mini, nano, G. Flash 2.5, Kling 2.0 and 🐬 Gemma? Huge AI week + A2A protocol interview
17 apr· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Wow. Just… wow. What a week, folks. Seriously, this has been one for the books.
This week was dominated by OpenAI's double whammy: first the GPT-4.1 family dropped with a mind-boggling 1 million token context window, followed swiftly by the new flagship reasoning models, o3 and o4-mini, which are already blowing minds with their agentic capabilities. We also saw significant moves from Google with VEO-2 going GA, the fascinating A2A protocol launch (we had an amazing interview with Google's Todd Segal about it!), and even an attempt to talk to dolphins with DolphinGemma. Kling stepped up its video game, Cohere dropped SOTA multimodal embeddings, and ByteDance made waves in image generation. Plus, the open-source scene had some interesting developments, though perhaps overshadowed by the closed-source giants this time.
o3 has absolutely taken the crown as the conversation piece, so lets start with it (as always, TL;DR and shownotes at the end, and here's the embedding of our live video show)
Big Company LLMs + APIs
OpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use (Blog, Watch Party)
The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT disappoint and well.. even surprised!
o3 is not only SOTA on nearly all possible logic, math and code benchmarks, which is to be expected from the top reasoning model, it also, and I think for the first time, is able to use tools during its reasoning process. Tools like searching the web, python coding, image gen (which it... can zoom and rotate and crop images, it's nuts) to get to incredible responses faster.
Tool using reasoner are... almost AGI?
This is the headline feature for me. For the first time, these o-series models have full, autonomous access to all built-in tools (web search, Python code execution, file search, image generation with Sora-Image/DALL-E, etc.). They don't just use tools when told; they decide when and how to chain multiple tool calls together to solve a problem. We saw logs with 600+ consecutive tool calls! This is agent-level reasoning baked right in.
Anecdote: We tested this live with a complex prompt: "generate an image of a cowboy that on his head is the five last digits of the hexadecimal code of the MMMU score of the latest Gemini model." o3 navigated this multi-step task flawlessly: figuring out the latest model was Gemini 2.5, searching for its MMMU score, using the Python tool to convert it to hex and extract the digits, and then using the image generation tool. It involved multiple searches and reasoning steps. Absolutely mind-blowing 🤯.
Thinking visually with images
This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating. The models can now perform all these tasks to multimodal requests from users. Sci-fi stuff!
Benchmark Dominance: As expected, these models crush existing benchmarks.
o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!
o4-mini is no slouch either. It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks. It’s a reasoning powerhouse at a fraction of the cost.
Incredible Long Context Performance
Yam highlighted this – on the Fiction Life benchmark testing deep comprehension over long contexts, o3 maintained nearly 100% accuracy up to 120,000 tokens, absolutely destroying previous models including Gemini 2.5 Pro and even the new GPT-4.1 family on this specific eval. While its context window is currently 200k (unlike 4.1's 1M), its performance within that window is unparalleled.
Cost-Effective Reasoning: They're not just better, they're cheaper for the performance you get.
* o3: $10 input / $2.50 cached / $40 output per million tokens.
* o4-mini: $1.10 input / $0.275 cached / $4.40 output per million tokens. (Cheaper than GPT-4.0!)
Compute Scaling Validated: OpenAI confirmed these models used >10x the compute of o1 and leverage test-time compute scaling (spending longer on harder problems), further proving their scaling law research.
Memory Integration: Both models integrate with ChatGPT's recently upgraded memory feature which has access to all your previous conversations (which we didn't talk about but is absolutely amazing, try asking o3 stuff it knows about you and have ti draw conclusions!)
Panel Takes & Caveats:While the excitement was palpable, Yam noted some community observations about potential "rush" – occasional weird hallucinations or questionable answers compared to predecessors, possibly a side effect of cramming so much training data. Nisten, while impressed, still found the style of GPT-4.1 preferable for specific tasks like generating structured medical notes in his tests. It highlights that benchmarks aren't everything, and specific use cases require evaluation (shameless plug: use tools like W&B Weave for this!).
I'll add my own, I use all the models every week to help me draft posts, and o3 was absolute crap about matching my tone. % of what's written above it was able to mimic. Gemini remains undefeated for me and this task.
Though, Overall, o3 and o4-mini feel like a paradigm shift towards more autonomous, capable AI assistants. The agentic future feels a whole lot closer.
OpenAI Launches GPT-4.1 Family: 1 Million Tokens & Killing 4.5! (Our Coverage, Prompting guide)
Before the o3 shockwave, Monday brought its own major AI update: the GPT-4.1 family. This was the API-focused release, delivering massive upgrades for developers.
The Headline: One Million Token Context Window! 🤯 Yes, you read that right. All three new models – GPT-4.1 (flagship), GPT-4.1 mini (cheaper/faster), and GPT-4.1 nano (ultra-cheap/fast) – can handle up to 1 million tokens. This is a monumental leap, enabling use cases that were previously impossible or required complex chunking strategies.
Key Details:
Goodbye GPT-4.5!
In a surprising twist, OpenAI announced they are deprecating the recently introduced (and massive) GPT-4.5 model within 90 days in the API. Why? Because GPT-4.1 actually outperforms it on key benchmarks like SW-Bench, Aider Polyglot, and the new long-context MRCR eval, while being far cheaper to run. It addresses the confusion many had: why was 4.5 seemingly worse than 4.1? It seems 4.5 was a scaling experiment, but 4.1 represents a more optimized, better-trained checkpoint on superior data. RIP 4.5, we hardly knew ye (in the API).
The Prompt Sandwich Surprise! 🥪:
This was wild. Following OpenAI's new prompting guide, I tested the "sandwich" technique (instructions -> context -> instructions again) on my hard reasoning eval using W&B Weave.
For GPT-4.1, it made no difference (still got 48%). But for GPT-4.1 mini, the score jumped from 31% to 49% – essentially matching the full 4.1 model just by repeating the prompt! That's a crazy performance boost for a simple trick. Even nano saw a slight bump. Lesson: Evaluate prompt techniques! Don't assume they won't work.
Million-Token Recall Confirmed: Using Needle-in-Haystack and their newly open-sourced MRCR benchmark (Multi-round Co-reference Resolution – more in Open Source), OpenAI showed near-perfect recall across the entire 1 million token window for all three models, even nano! This isn't just a theoretical limit; the recall seems robust.
Multimodal Gains: Impressively, 4.1 mini hit 72% on Video-MME, pushing SOTA for long-video Q&A in a mid-tier model by analyzing frame sequences.
4.1 mini seems to be the absolute powerhouse of this release cycle, it nearly matches the intelligence of the previous 4o, while being significantly cheaper and much much faster with 1M context window!
Windsurf (and Cursor) immediately made the 4.1 family available, offering a free week for users to test them out (likely to gather feedback and maybe influenced by certain acquisition rumors 😉). Devs reported them feeling snappier and less verbose than previous models.
Who Should Use Which OpenAI API?
My initial take:
* For complex reasoning, agentic tasks, or just general chat: Use o3 (if you need the best) or o4-mini (for amazing value/speed).
* For API development, especially coding or long-context tasks: Evaluate the GPT-4.1 family. Start with 4.1 mini – it's likely the sweet spot for performance/cost, especially with smart prompting. Use 4.1 if mini isn't quite cutting it. Use nano for simple, high-volume tasks like translation or basic classification.
The naming is still confusing (thanks Nisten for highlighting the UI nightmare!), but the capability boost across the board is undeniable.
Hold the Phone! 🚨 Google Fires Back with Gemini 2.5 Flash in Breaking News
Just when we thought the week couldn't get crazier, Google, likely reacting to OpenAI's rapid-fire launches, just dropped Gemini 2.5 Flash into preview via the Gemini API (in AI Studio and Vertex AI). This feels like Google's direct answer, aiming to blend reasoning capabilities with speed and cost-effectiveness.
The Big Twist: Controllable Thinking Budgets!Instead of separate models like OpenAI, Gemini 2.5 Flash tries to do both reasoning and speed/cost efficiency in one model. The killer feature? Developers can set a "thinking budget" (0 to 24,576 tokens) per API call to control the trade-off:
* Low/Zero Budget: Prioritizes speed and low cost (very cheap: $0.15 input / $0.60 output per 1M tokens), great for simpler tasks.
* Higher Budget: Allows the model multi-step reasoning "thinking" for better accuracy on complex tasks, at a higher cost ($3.50 output per 1M tokens, including reasoning tokens).
This gives granular control over the cost/quality balance within the same model.
Performance & Specs:Google claims strong performance, ranking just behind Gemini 2.5 Pro on Hard Prompts in ChatBot Arena and showing competitiveness against o4-mini and Sonnet 3.7 in their benchmarks, especially given the flexible pricing.
Key specs are right up there with the competition:
* Multimodal Input: Text, Images, Video, Audio
* Context Window: 1 million tokens (matching GPT-4.1!)
* Knowledge Cutoff: January 2025
How to Control Thinking:Simply set the thinking_budget parameter in your API call (Python/JS examples available in their docs). If unspecified, the model decides automatically.
My Take: This is a smart play by Google. The controllable thinking budget is a unique and potentially powerful feature for optimizing across different use cases without juggling multiple models. With 1M context and competitive pricing, Gemini 2.5 Flash is immediately a major contender in the ongoing AI arms race. Definitely one to evaluate! Find more in the developer docs and Gemini Cookbook.
Open Source: LLMs, Tools & more
OpenAI open sources MRCR eval and Codex (Mrcr HF, Codex Github)
Let's face it, this isn't the open source OpenAI coverage I was hoping for, Sam promised us an open source model, and they are about to drop this, I'd assume close to Google IO (May 20th) to steal thunder. But OpenAI did make OpenSource waves this week in addition to the above huge stories.
MRCR is a way to evaluate long context complex tasks, and they have taken this Gemini research and open sourced a dataset for this eval. 👏
But also, they have dropped the Codex CLI tool, which is a coding partner using o4-mini and o3 and made that tool open source as well (Unlike anthropic with Claude Code), which in turn saw 86+ Pull Requests approved within the first 24 hours!
The best part about this CLI, is that it's hardened security, using Apple Seatbelt which limits it execution to the current directory + temp files (on a mac at least)
Other Open Source Updates
While OpenAI's contributions were notable, it wasn't the only action this week:
* Microsoft's BitNet v1.5 (HF): Microsoft quietly dropped updates to BitNet, continuing their exploration into ultra-low-bit (ternary) models for efficiency. As Nisten pointed out on the show though, keep in mind these still use some higher-precision layers, so they aren't purely 1.5-bit in practice just yet. Important research nonetheless!
* INTELLECT-2 Distributed RL (Blog, X): Prime Intellect did something wild – training INTELLECT-2, a 32B model, using globally distributed, permissionless reinforcement learning. Basically, anyone with a GPU could potentially contribute. Fascinating glimpse into decentralized training!
* Z.ai (Formerly ChatGLM) & GLM-4 Family (X, HF, GitHub): The team behind ChatGLM rebranded to Z.ai and released their GLM-4 family (up to 32B parameters) under the very permissive MIT license. They're claiming performance competitive with much larger models like Qwen 72B, which is fantastic news for commercially usable open source!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This Week's Buzz: Playground Updates & A Deep Dive into A2A
On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape.
Weave Playground Supports GPT-4.1 and o3/o4-mini (X)
With all these new models dropping, how do you actually choose which one is best for your application? You need to evaluate! Our W&B Weave Playground now has full support for the new GPT-4.1 family and the o3/o4-mini models.
If you're using Weave to monitor your LLM apps in production, you can easily grab a trace of a real user interaction, open it in the Playground, and instantly retry that exact same call (with all its context and history) using any of the new models side-by-side. It’s the fastest way to see how o3 compares to 4.1-mini or how Claude 3.7 stacks up against o4-mini on your specific data. Essential for making informed decisions in this rapidly changing environment.
Deep Dive: Understanding Google's A2A Protocol with Todd Segal
This was a highlight of the show for me. We were joined by Todd Segal, a Principal Software Engineer at Google working directly on the new Agent-to-Agent (A2A) protocol. There was some confusion initially about how A2A relates to the increasingly popular Model Context Protocol (MCP), so getting Todd's perspective was invaluable. W&B is a proud launch partner for the A2A protocol!
Key Takeaways from our Chat:
* A2A vs. MCP: Complementary, Not Competitive: Todd was clear: Google sees these as solving different problems. MCP is for Agents talking to Tools (structured, deterministic capabilities). A2A is for Agents talking to other Agents (unstructured, stateful, unpredictable, evolving interactions). Think of MCP like calling an API, and A2A like delegating a complex task to another expert service.
* The Need for A2A: It emerged from the need for specialized, domain-expert agents (built internally or by partners like Salesforce) to collaborate on complex, long-running tasks (e.g., booking a multi-vendor trip, coordinating an enterprise workflow) where simple tool calls aren't enough. Google's Agent Space product heavily utilizes A2A internally.
* Capability Discovery & Registries: A core concept is agents advertising their capabilities via an "agent card" (like a business card or resume). Todd envisions a future with multiple registries (public, private, enterprise-specific) where agents can discover other agents best suited for a task. This registry system is on the roadmap.
* Async & Long-Running Tasks: A2A is designed for tasks that might take minutes, hours, or even days. It uses a central "Task" abstraction which is stateful. Agents communicate updates (status changes, generated artifacts, requests for more info) related to that task.
* Push Notifications: For very long tasks, A2A supports a push notification mechanism. The client agent provides a secure callback URL, and the server agent can push updates (state changes, new artifacts) even if the primary connection is down. This avoids maintaining costly long-lived connections.
* Multimodal Communication: The protocol supports negotiation of modalities beyond text, including rendering content within iframes (for branded experiences) or exchanging video/audio streams. Essential for future rich interactions.
* Security & Auth: A2A deliberately doesn't reinvent the wheel. It relies on standard HTTP headers to carry authentication (OAuth tokens, internal enterprise credentials). Identity/auth handshakes happen "out of band" using existing protocols (OAuth, OIDC, etc.), and the resulting credentials are passed with A2A requests. Your user identity flows through standard mechanisms.
* Observability: Todd confirmed OpenTelemetry (OTel) support is planned for the SDKs. Treating agents like standard microservices means leveraging existing observability tools (like W&B Weave!) is crucial for tracing and debugging multi-agent workflows.
* Open Governance: While currently in a Google repo, the plan is to move A2A to a neutral foundation (like Linux Foundation) with a fully open governance model. They want this to be a true industry standard.
* Getting Started: Check out the GitHub repo (github.com/google/A2A), participate in discussions, file issues, and send PRs!
My take: A2A feels like a necessary piece of infrastructure for the next phase of AI agents, enabling complex, coordinated actions across different systems and vendors. While MCP handles the "how" of using tools, A2A handles the "who" and "what" of inter-agent delegation. Exciting times ahead! Big thanks to Todd for shedding light on this.
Vision & Video: Veo-2 Arrives, Kling Gets Slicker
The visual AI space keeps advancing rapidly.
Veo-2 Video Generation Hits GA in Vertex AI & Gemini App (Blog, Try It)
Google's answer to Sora and Kling, Veo-2, is now Generally Available (GA) for all Google Cloud customers via Vertex AI. You can also access it in the Gemini app.
Veo-2 produces stunningly realistic and coherent video, making it a top contender alongside OpenAI's Sora and Kling. Having it easily accessible in Vertex AI is a big plus for developers on Google Cloud.
I've tried and keep tyring all of them, VEO2 is an absolute beast in realism.
Kling 2.0 Creative Suite: A One-Stop Shop for Video AI? (X, Blog)
Kuaishou's Kling model also got a major upgrade, evolving into a full Kling 2.0 Creative Suite.
Anecdote: I actually stayed up quite late one night trying to piece together info from a Chinese live stream about this release! The dedication is real, folks. 😂
What's New:
* Kling 2.0 Master: The core video model, promising better motion, physics, and facial consistency (still 5s clips for now, but 30s/4K planned).
* Kolors 2.0: An integrated image generation and restyling model (think Midjourney-style filters).
* MVL (Multimodal Visual Language) Prompting: This is killer! You can now inline images directly within your text prompt for precise control (e.g., "Swap the hoodie in @video1 with the style of @image2"). This offers granular control artists have been craving.
* Multi-Elements Editor: A timeline-based editor to stitch clips, add lip-sync, sound effects (including generated ones like "car horn"), and music.
* Global Access: No more Chinese phone number requirement! Available worldwide at klingai.com.
* Official API via FAL: Developers can now integrate Kling 2.0 via our friends at ⚡ FAL Generative Media Cloud.
Kling is clearly aiming to be a holistic creative platform, reducing the need to jump between 17 different AI tools for image gen, video gen, editing, and sound. The MVL prompting is particularly innovative. Very impressive package.
Voice & Audio: Talking to Dolphins? 🐬
DolphinGemma: Google AI Listens to Flipper (Blog)
In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma.
It's a ~400M parameter audio model based on the Gemma architecture (using SoundStream for audio tokenization) trained specifically on decades of recorded dolphin clicks, whistles, and pulses.The goal? To decipher the potential syntax and structure within dolphin communication and eventually enable rudimentary two-way interaction using underwater communication devices. It runs on a Pixel phone for field deployment.
This is just awesome. Using AI not just for human tasks but to potentially bridge the communication gap with other intelligent species is genuinely inspiring. We joked on the show about doing a segment of just dolphin noises – maybe next time if DolphinGemma gets an API! 🤣
AI Art & Diffusion & 3D: Seedream Challenges the Champs
Seedream 3.0: ByteDance's Bilingual Image Powerhouse (Tech post, arXiv, AIbase news)
ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model.
Highlights:
* Generates native 2048x2048 images.
* Fast inference (~3 seconds for 1Kx1K on an A100).
* Excellent bilingual (Chinese/English) text rendering, even small fonts.
* Uses Scaled-ROPE-v2 for better high-resolution generation without artifacts.
* Claims to outperform SDXL-Turbo and Qwen-Image on fidelity and prompt adherence benchmarks.
* Available via Python SDK and REST API within their Doubao Studio and coming soon to dreamina.com
Phew! We made it. What an absolute avalanche of news. OpenAI truly dominated with the back-to-back launches of the hyper-capable o3/o4-mini and the massively scaled GPT-4.1 family. Google countered strongly with the versatile Gemini 2.5 Flash, key GA releases like Veo-2, and the strategically important A2A protocol. The agent ecosystem took huge leaps forward with both A2A and broader MCP adoption. And we saw continued innovation in multimodal embeddings, video generation, and even niche areas like bioacoustics and low-bit models.
If you feel like you missed anything (entirely possible this week!), the TL;DR and links below should help. Please subscribe if you haven't already, and share this with a friend if you found it useful – it's the best way to support the show!
I have a feeling next week won't be any slower. Follow us on X/Twitter for breaking news between shows!
Thanks for tuning in, keep building, keep learning, and I'll see you next Thursday!
Alex
TL;DR and Show Notes
Everything we covered today in bite-sized pieces with links!
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed)
* Todd Segal - Principal Software Engineer @ Google - Working on A2A Protocol
* Big CO LLMs + APIs
* 👑 OpenAI launches o3 and o4-mini in chatGPT & API (Blog, Our Coverage, o3 and o4-mini announcement)
* OpenAI launches GPT 4.1, 4.1-mini and 4.1-nano in API (Our Coverage, Prompting guide)
* 🚨 Google launches Gemini 2.5 Flash with controllable thinking budgets (Blog Post - Placeholder Link, API Docs)
* Mistral classifiers Factory
* Claude does research + workspace integration (Blog)
* Cohere Embed‑4 — Multimodal embeddings for enterprise search (Blog, Docs Changelog, X)
* Open Source LLMs
* OpenAI open sources MRCR Long‑Context Benchmark (Hugging Face)
* Microsoft BitNet v1.5 (HF)
* INTELLECT‑2 — Prime Intellect’s 32B “globally‑distributed RL” experiment (Blog, X)
* Z.ai (previously chatGLM) + GLM‑4‑0414 open‑source family (X, HF Collection, GitHub)
* This weeks Buzz + MCP/A2A
* Weave playground support for GPT 4.1 and o3/o4-mini models (X)
* Chat with Todd Segal - A2A Protocol (GitHub Spec)
* Vision & Video
* Veo‑2 Video Generation in GA, Gemini App (Dev Blog)
* Kling 2.0 Creative Suite (X, Blog)
* ByteDance public Seaweed-7B, a video generation foundation model (seaweed.video)
* Voice & Audio
* DolphinGemma — Google AI tackles dolphin communication (Blog)
* AI Art & Diffusion & 3D
* Seedream 3.0 bilingual image diffusion – ByteDance (Tech post, arXiv, AIbase news)
* Tools
* OpenAI debuts Codex CLI, an open source coding tool for terminals (Github)
* Use o3 with Windsurf (which OpenAI is rumored to buy at $3B) via the mac app integration + write back + multiple files

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
💯 ThursdAI - 100th episode 🎉 - Meta LLama 4, Google tons of updates, ChatGPT memory, WandB MCP manifesto & more AI news
10 apr· ThursdAI - The top AI news from the past week
Hey Folks,
Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI 👏 100 episodes in a year and a half (as I started publishing much later than I started going live, and the first episode was embarrassing), 100 episodes that documented INCREDIBLE AI progress, we mention on the show today, we used to be excited by context windows jumping from 4K to 16K!
I want to extend a huge thank you to every one of you, who subscribes, listens to the show on podcasts, joins the live recording (we regularly get over 1K live viewers across platforms), shares with friends and highest thank you for the paid supporters! 🫶 Sharing the AI news progress with you, energizes me to keep going, despite the absolute avalanche of news every week.
And what a perfect way to celebrate the 100th episode, on a week that Meta dropped Llama 4, sending the open-source world into a frenzy (and a bit of chaos). Google unleashed a firehose of announcements at Google Next. The agent ecosystem got a massive boost with MCP and A2A developments. And we had fantastic guests join us – Michael Lou diving deep into the impressive DeepCoder-14B, and Liad Yosef & Ido Salomon sharing their wild ride creating the viral GitMCP tool.
I really loved today's show, and I encourage those of you who only read, to give this a watch/listen, and those of you who only listen, enjoy the recorded version (though longer and less edited!)
Now let's dive in, there's a LOT to talk about (TL;DR and show notes as always, at the end of the newsletter)
Open Source AI & LLMs: Llama 4 Takes Center Stage (Amidst Some Drama)
Meta drops Llama 4 - Scout 109B/17BA & Maverick 400B/17BA (Blog, HF, Try It)
This was by far the biggest news of this last week, and it dropped... on a Saturday? (I was on the mountain ⛷️! What are you doing Zuck)
Meta dropped the long awaited LLama-4 models, huge ones this time
* Llama 4 Scout: 17B active parameters out of ~109B total (16 experts).
* Llama 4 Maverick: 17B active parameters out of a whopping ~400B total (128 experts).
* Unreleased: Behemoth - 288B active with 2 Trillion total parameters chonker!
* Both base and instruct finetuned models were released
These new models are all Multimodal, Multilingual MoE (mixture of experts) architecture, and were trained with FP8, for significantly more tokens (around 30 Trillion Tokens!) with interleaved attention (iRoPE), and a refined SFT > RL > DPO post-training pipeline.
The biggest highlight is the stated context windows, 10M for Scout and 1M for Maverick, which is insane (and honestly, I haven't yet seen a provider that is even remotely able to support anything of this length, nor do I have the tokens to verify it)
The messy release - Big Oof from Big Zuck
Not only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was... not the model they released!?
It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval.
Additionally, during the next days, folks noticed discrepancies between the stated eval scores Meta released, and the ability to evaluate them independently, including our own Wolfram, who noticed that a quantized version of Scout, performed better on his laptop while HIGHLY quantized (read: reduced precision) than it was performing on the Together API inference endpoint!?
We've chatted on the show that this may be due to some VLLM issues, and speculated about other potential reasons for this.
Worth noting the official response from Ahmad Al-Dahle, head of LLama at Meta, who mentioned stability issues between providers and absolutely denied any training on any benchmarks
Too big for its own good (and us?)
One of the main criticism the OSS community had about these releases, is that for many of us, the reason for celebrating Open Source AI, is the ability to run models without network, privately on our own devices.
Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts! These models, despite being "only" 17B active params, are huge and way to big to run on most local hardware, and so the question then is, if we're getting a model that HAS to run on a service, why not use Gemini 2.5 that's MUCH better and faster and cheaper than LLama?
Why didn't Meta release those sizes? Was it due to an inability to beat Qwen/DeepSeek enough? 🤔
My Take
Despite the absolutely chaotic rollout, this is still a monumental effort from Meta. They spent millions on compute and salaries to give this to the community. Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready. But Meta is standing up for Western open-source in a big way. We have to celebrate the core contribution while demanding better rollout practices next time. As Wolfram rightly said, the real test will be the fine-tunes and distillations the community builds on these base models. Releasing the base weights is crucial for that. Let's see if the community can tame this beast once the inference dust settles. Shout out to Ahmed Al-Dahle and the whole Llama team at Meta – incredible work, messy launch, but thank you for pushing open source forward. 🎉
Together AI & Agentica (Berkley) finetuned DeepCoder-14B with reasoning (X, Blog)
Amidst the Llama noise, we got another stellar open-source release! We were thrilled to have Michael Lou from Agentica/UC Berkeley join us to talk about DeepCoder-14B-Preview which beats DeepSeek R1 and even o3-mini on several coding benchmarks.
Using distributed Reinforcement Learning (RL), it achieves 60.6% Pass@1 accuracy on LiveCodeBench, matching the performance of models like o3-mini-2025-01-31 (Low) despite its smaller size.
The stated purpose of the project is to democratize RL and they have open sourced the model (HF), the dataset (HF), the Weights & Biases logs and even the eval logs!
Shout out to Michael, Sijun and Alpay and the rest of the team who worked on this awesome model!
NVIDIA Nemotron ULTRA is finally here, 253B pruned Llama 3-405B (HF)
While Llama 4 was wrapped in mystery, NVIDIA dropped their pruned and distilled finetune of the previous Llama chonker 405B model, turning at just about half the parameters.
And they were able to include the LLama-4 benchmarks in their release, showing that the older Llama, finetuned can absolutely beat the new ones at AIME, GPQA and more.
As a reminder, we covered the previous 2 NEMOTRONS and they are a combined reasoning and non reasoning models, so the jump is not that surprising, and it does seem like a bit of eval cherry picking happened here.
Nemotron Ultra supports 128K context and fits on a single 8xH100 node for inference. Built on open Llama models and trained on vetted + synthetic data, it's commercially viable. Shout out to NVIDIA for releasing this, and especially for pushing open reasoning datasets which Nisten rightly praised as having long-term value beyond the models themselves.
More Open Source Goodness: Jina, DeepCogito, Kimi
The open-source train didn't stop there:
* Jina Reranker M0: Our friends at Jina released a state-of-the-art multimodal reranker model. If you're doing RAG with images and text, this looks super useful for improving retrieval quality across languages and modalities (Blog, HF)
* DeepCogito: A new company emerged releasing a suite of Llama fine-tunes (3B up to 70B planned, with larger ones coming) trained using a technique they call Iterated Distillation and Amplification (IDA). They claim their 70B model beats DeepSeek V2 70B on some benchmarks . Definitely one to watch. (Blog, HF)
* Kimi-VL & Kimi-VL-Thinking: MoonShot, who sometimes get lost in the noise, released incredibly impressive Kimi Vision Language Models (VLMs). These are MoE models with only ~3 Billion active parameters, yet they're showing results on par with or even beating models 10x larger (like Gemma 2 27B) on benchmarks like MathVision and ScreenSpot. They handle high-res images, support 128k context, and crucially, include a reasoning VLM variant. Plus, they're MIT licensed! Nisten's been following Kimi and thinks they're legit, just waiting for easier ways to run them locally. Definitely keep an eye on Kimi. (HF)
This Week's Buzz from Weights & Biases - Observable Tools & A2A!
This week was personally very exciting on the W&B front, as I spearheaded and launched initiatives directly related to the MCP and A2A news!
W&B launches the observable.tools initiative!
As MCP takes off, one challenge becomes clear: observability. When your agent calls an external MCP tool, that part of the execution chain becomes a black box. You lose the end-to-end visibility crucial for debugging and evaluation.
That's why I'm thrilled that we launched Observable Tools (Website) – an initiative championing full-stack agent observability, specifically within the MCP ecosystem. Our vision is to enable developers using tools like W&B Weave to see inside those MCP tool calls, getting a complete trace of every step.
The core of this is Proposal RFC 269 on the official MCP GitHub spec, which I authored! (My first RFC, quite the learning experience!). It details how to integrate OpenTelemetry tracing directly into the MCP protocol, allowing tools to securely report detailed execution spans back to the calling client (agent). We went deep on the spec, outlining transmission mechanisms, schemas, and rationale.
My ask to you, the ThursdAI community: Please check out observable.tools, read the manifesto, watch the fun video we made, and most importantly, go to the RFC 269 proposal (shortcut: wandb.me/mcp-spec). Read it, comment, give feedback, and upvote if you agree! We need community support to make this impossible for the MCP maintainers to ignore. Let's make observability a first-class citizen in the MCP world! We also invite our friends from across the LLM observability landscape (LangSmith, Braintrust, Arize, Galileo, etc.) to join the discussion and collaborate.
W&B is a Launch Partner for Google's A2A
As mentioned earlier, we're also excited to be a launch partner for Google's new Agent2Agent (A2A) protocol. We believe standardized communication between agents is the next critical step, and we'll be supporting A2A alongside MCP in our tools. Exciting times for agent infrastructure! I've invited Google folks to next week to discuss the protocol in depth!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Big Company LLMs + APIs: Google's Onslaught & OpenAI's Memory Upgrade
While open source had a wild week, the big players weren't sleeping. Google especially came out swinging at Google Next.
Google announces TONS of new things at Next 🙌 (Blog)
Google I/O felt like a preview, Google Next felt like the delivery truck backing up and dumping everything. Here's the whirlwind tour:
* Gemini 2.5 Flash API: The faster, cheaper Gemini 2.5 model is coming soon to Vertex AI. (Still waiting on that general API access!).
* Veo 2 Editing: Their top-tier video model (competing with Sora, Kling) gets editing capabilities. Very cool.
* Imagen 3 Updates: Their image model gets improvements, including inpainting.
* Lyria: Text-to-music model moves into preview.
* TPU v7 (Ironwood): New TPU generation coming soon. As Nisten noted, Google's infrastructure uptime is consistently amazing, which could be a winning factor regardless of model SOTA status.
* Chirp 3 HD Voices + Voice Cloning: This one raised eyebrows. The notes mentioned HD voices and voice cloning. Cloning is a touchy subject the big players usually avoid publicly (copyright, deepfakes). Still digging for confirmation/details on this – if Google is really offering public voice cloning, that's huge. Let me know if you find a link!
* Deep Research gets Gemini 2.5 Pro: The experimental deep research feature in Gemini (their answer to OpenAI's research agent) now uses the powerful 2.5 Pro model. Google released comparison stats showing users strongly prefer it (70%) over OpenAI's offering, citing better instruction following and comprehensiveness. I haven't fully tested the 2.5 version yet, but the free tier access is amazing. and just look at those differences in preference compared to OAI Deep Research!
Firebase Studio (firebase.studio): Remember Project IDX? It's been rebranded and launched as Firebase Studio. This is Google's answer to the wave of "vibe coding" web builders like Lovable, Bolt and a few more. It's a full-stack, cloud-based GenAI environment for building, testing, and deploying apps, integrated with Firebase and likely Gemini. Looks promising!
Google Embraces MCP & Launches A2A Protocol!
Two massive protocol announcements from Google that signal the maturation of the AI agent ecosystem:
* Official MCP Support! (X)This is huge. Following Microsoft and AWS, Google (via both Sundar Pichai and Demis Hassabis) announced official support for Anthropic's Model Context Protocol (MCP) in Gemini models and SDKs. MCP is rapidly becoming the standard for how agents discover and use tools securely and efficiently. With Google onboard, there's basically universal major vendor support. MCP is here to stay.
* Agent2Agent (A2A) Protocol (Blog , Spec, W&B Blog)Google also launched a new open standard, A2A, designed for interoperability between different AI agents. Think of agents built by different vendors (Salesforce, ServiceNow, etc.) needing to talk to each other securely to coordinate complex workflows across enterprise systems. Built on web standards (HTTP, SSE, JSON-RPC), it handles discovery, task management (long-running!), and modality negotiation. Importantly, Google positions A2A as complementary to MCP, not competitive. MCP is how an agent uses a tool, A2A is how an agent talks to another agent. Weights & Biases is proud to be one of the 50+ launch partners working with Google on this! We'll do a deeper dive soon, but this + MCP feels like the foundation for a truly interconnected agent future.
Cloudflare - new Agents SDK (agents.cloudflare.com)
Speaking of agents, Cloudflare launched their new Agents SDK (npm i agents). Built on their serverless infrastructure (Workers, Durable Objects), it offers a platform for building stateful, autonomous AI agents with a compelling pricing model (pay for CPU time, not wall time). This ties into the GitMCP story later – Cloudflare is betting big on the edge agent ecosystem.
Other Big Co News:
* Anthropic MAX: A new $200/month tier for Claude, offering higher usage quotas but no new models. Meh.
* Grok 3 API: Elon's xAI finally launched the API tier for Grok 3 (plus Fast and Mini variants). Now you can test its capabilities yourself. We're still waiting for the promised Open Source Grok-2
🚨 BREAKING NEWS 🚨 OpenAI Upgrades Memory
Right on cue during the show, OpenAI dropped a feature update! Sam Altman hyped something coming, and while it wasn't the o3/o4-mini models (those are coming next), it's a significant enhancement to ChatGPT Memory.
Previously, Memory tried to selectively save key facts. Now, when enabled, it can reference ALL of your past chats to personalize responses. Preferences, interests, past projects – it can potentially draw on everything. OpenAI states there's no storage limit for what it can reference.
How? Likely some sophisticated RAG/vector search under the hood, not stuffing everything into context. LDJ mentioned he might have had this rolling out silently for weeks, and while the immediate difference wasn't huge, the potential is massive as models get better at utilizing this vast personal context.
The immediate reaction? Excitement mixed with a bit of caution. As Wolfram pointed out, do I really want it remembering every single chat? Configurable memory (flagging chats for inclusion/exclusion) seems like a necessary follow-up. Thanks for the feature request, Wolfram! (And yes, Europe might not get this right away anyway...). This could finally stop ChatGPT from asking me basic questions it should know from our history!
Prompt suggestion: Ask the new chatGPT with memory, a think that you asked it that you likely forgot.
Just don't asked it what was the most boring thing you asked it, I got cooked I'm still feeling raw 😂
Vision & Video: Kimi Drops Tiny But Mighty VLMs
The most impressive long form AI video paper dropped, showing that it's possible to create 1 minute long video, with incredible character and scene consistency
This paper introduces TTT layers (Test Time Training) to a pre-trained transformer, allowing it to one shot generate these incredibly consistent long scenes. Can't wait to see how the future of AI video evolves with this progress!
AI Art & Diffusion & 3D: HiDream Takes the Open Crown
HiDream-I1-Dev 17B MIT license new leading open weights image gen! (HF)
Just when we thought the image gen space was settling, HiDream, a Chinese company, open-sourced their HiDream-I1 family under MIT license! This 17B parameter model comes in Dev, Full, and Fast variants.
The exciting part? Based on early benchmarks (like Artificial Analysis Image Arena), HiDream-I1-Dev surpasses Flux 1.1 [Pro], Recraft V3, Reve and Imagen 3 while being open source! It boasts outstanding prompt following and text rendering capabilities.
HiDream's API is coming soon too and I really hope it's finetunable!
Tools: GitMCP - The Little Repo Tool That Could
GitMCP - turn any github repo into an MCP server (website)
We wrapped up the show with a fantastic story from the community. We had Liad Yosef (Shopify) and Ido Salomon (Palo Alto Networks) join us to talk about GitMCP.
It started with a simple problem: a 3MB LLM.txt file (a format proposed by Jeremy Howard for repo documentation) too large for context windows. Liad and Ido, working nights and weekends, built an MCP server that could ingest any GitHub repo (prioritizing LLM.txt if present, falling back to Readmes/code comments) and expose its documentation via MCP tools (semantic search, fetching).
This means any MCP-compatible client (like Cursor, potentially future ChatGPT/Gemini) can instantly query the documentation of any public GitHub repo just by pointing to the GitMCP URL for that repo (e.g., https://gitmcp.io/user/repo). As Yam Peleg pointed out during the show, the genius here is dynamically generating a customized tool specifically for that repo, making it incredibly easy for the LLM to use.
Then, the story got crazy. They launched, went viral, almost melted their initial Vercel serverless setup due to traffic and SSE connection costs (100$+ per hour!). DMs flew back and forth with Vercel's CEO, then Cloudflare's CTO swooped in offering to sponsor hosting on Cloudflare's unreleased Agents platform if they migrated immediately. A frantic weekend of coding ensued, culminating in a nail-biting domain switch and a temporary outage before getting everything stable on Cloudflare.
The project has received massive praise (including from Jeremy Howard himself) and is solving a real pain point for developers wanting to easily ground LLMs in project documentation. Huge congrats to Liad and Ido for the amazing work and the wild ride! Check out gitmcp.io!
Wrapping Up Episode 100!
Whew! What a show. From the Llama 4 rollercoaster to Google's AI barrage, the rise of agent standards like MCP and A2A, groundbreaking open source models, and incredible community stories like GitMCP – this episode truly showed an exemplary week in AI and underlined the reason I do this every week. It's really hard to keep up, and so if I commit to you guys, I stay up to date myself!
Hitting 100 episodes feels surreal. It's been an absolute privilege sharing this journey with Wolfram, LDJ, Nisten, Yam, all our guests, and all of you. Seeing the community grow, hitting milestones like 1000 YouTube subscribers today, fuels us to keep going 🎉
The pace isn't slowing down. If anything, it's accelerating. But we'll be right here, every Thursday, trying to make sense of it all, together.
If you missed anything, don't worry! Subscribe to the ThursdAI News Substack for the full TL;DR and links below.
Thanks again for making 100 episodes possible. Here's to the next 100! 🥂
Keep tinkering, keep learning, and I'll see you next week.
Alex
TL;DR and Show Notes
* Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed
* Michael Luo @michaelzluo - CS PhD @ UC Berkeley; AI & Systems
* Liad Yosef (@liadyosef), Ido Salomon (@idosal1) - GitMCP creators
* Open Source LLMs
* Meta drops LLama 4 (Scout 109B/17BA & Maverick 400B/17BA) - (Blog, HF, Try It)
* Together AI and Agentica (UC Berkley) announce DeepCoder-14B (X, Blog)
* NVIDIA Nemotron Ultra is here! 253B pruned LLama 3-405B (X, HF)
* Jina Reranker M0 - SOTA multimodal reranker model (Blog, HF)
* DeepCogito - SOTA models 3-70B - beating DeepSeek 70B - (Blog, HF)
* ByteDance new release - Seed-Thinking-v1.5
* Big CO LLMs + APIs
* Google announces TONS of new things 🙌 (Blog)
* Google launches Firebase Studio (website)
* Google is announcing official support for MCP (X)
* Google announces A2A protocol - agent 2 agent communication (Blog, Spec, W&B Blog)
* Cloudflare - new Agents SDK (Website)
* Anthropic MAX - $200/mo with more quota
* Grok 3 finally launches API tier (API)
* OPenAI adds enhanced memory to ChatGPT - can remember all your chats (X)
* This weeks Buzz - MCP and A2A
* W&B launches the observable.tools initiative & invite people to comment on the MCP RFC
* W&B is the launch partner for Google's A2A (Blog)
* Vision & Video
* Kimi-VL and Kimi-VL-Thinking - A3B vision models (X, HF)
* One-Minute Video Generation with Test-Time Training (Blog, Paper)
* Voice & Audio
* Amazon - Nova Sonic - speech2speech foundational model (Blog)
* AI Art & Diffusion & 3D
* HiDream-I1-Dev 17B MIT license new leading open weights image gen 0 passes Flux1.1[pro] ! (HF)
* Tools
* GitMCP - turn any github repo into an MCP server (try it)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - Apr 3rd - OpenAI Goes Open?! Gemini Crushes Math, AI Actors Go Hollywood & MCP, Now with Observability?
3 apr· ThursdAI - The top AI news from the past week
Woo! Welcome back to ThursdAI, show number 99! Can you believe it? We are one show away from hitting the big 100, which is just wild to me. And speaking of milestones, we just crossed 100,000 downloads on Substack alone! [Insert celebratory sound effect here 🎉]. Honestly, knowing so many of you tune in every week genuinely fills me with joy, but also a real commitment to keep bringing you the the high-signal, zero-fluff AI news you count on. Thank you for being part of this amazing community! 🙏
And what a week it's been! I started out busy at work, playing with the native image generation in ChatGPT like everyone else (all 130 million of us!), and then I looked at my notes for today… an absolute mountain of updates. Seriously, one of those weeks where open source just exploded, big companies dropped major news, and the vision/video space is producing stuff that's crossing the uncanny valley.
We’ve got OpenAI teasing a big open source release (yes, OpenAI might actually be open again!), Gemini 2.5 showing superhuman math skills, Amazon stepping into the agent ring, truly mind-blowing AI character generation from Meta, and a personal update on making the Model Context Protocol (MCP) observable. Plus, we had some fantastic guests join us live!
So buckle up, grab your coffee (or whatever gets you through the AI whirlwind), because we have a lot to cover. Let's dive in! (as always, show notes and links in the end)
OpenAI Makes Waves: Open Source Tease, Tough Evals & Billions Raised
It feels like OpenAI was determined to dominate the headlines this week, hitting us from multiple angles.
First, the potentially massive news: OpenAI is planning to release a new open source model in the "coming months"! Kevin Weil tweeted that they're working on a "highly capable open language model" and are actively seeking developer feedback through dedicated sessions (sign up here if interested) to "get this right." Word on the street is that this could be a powerful reasoning model. Sam Altman also cheekily added they won't slap on a Llama-style
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Mar 27 - Gemini 2.5 Takes #1, OpenAI Goes Ghibli, DeepSeek V3 Roars, Qwen Omni, Wandb MCP & more AI news
27 mar· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back to ThursdAI! And folks, what an absolutely insane week it's been in the world of AI. Seriously, as I mentioned on the show, we don't often get weeks this packed with game-changing releases.
We saw Google emphatically reclaim the #1 LLM spot with Gemini 2.5 Pro (and OpenAI try really hard to hit back with a new ChatGPT), DeepSeek dropped a monster 685B parameter open-source model, Qwen launched a tiny but mighty 7B Omni model that handles voice and video like a champ, and OpenAI finally gave us native image generation in GPT-4o, immediately unleashing a tidal wave of Ghibli-fication across the internet. It was intense, with big players seemingly trying to one-up each other constantly – remember when Sam Altman dropped Advanced Voice Mode right when Google was about to show Astra? This weeks was this, on steroids.
We had a fantastic show trying to unpack it all, joined by the brilliant Tulsee Doshi from the Google Gemini team, my Weights & Biases colleague Morgan McQuire talking MCP tools, and the MLX King himself, Prince Canuma. Plus, my awesome co-hosts Wolfram, Nisten, and Yam were there to add their insights. (watch the LIVE recap or keep reading and listen to the audio pod)
So, grab your beverage of choice, buckle up, and let's try to make sense of this AI whirlwind! (TL'DR and show notes at the bottom 👇)
Big CO LLMs + APIs
🔥 Google Reclaims #1 with Gemini 2.5 Pro (Thinking!)
Okay, let's start with the big news. Google came out swinging this week, dropping Gemini 2.5 Pro and, based on the benchmarks and our initial impressions, taking back the crown for the best all-around LLM currently available. (Check out the X announcement, the official blog post, and seriously, go try it yourself at ai.dev).
We were super lucky to have Tulsee Doshi, who leads the product team for Gemini modeling efforts at Google, join us on the show to give us the inside scoop. Gemini 2.5 Pro Experimental isn't just an incremental update; it's topping benchmarks in complex reasoning, science, math, and coding. As Tulsee explained, this isn't just about tweaking one thing – it's a combination of a significantly enhanced base model and improved post-training techniques, including integrating those "thinking" capabilities (like chain-of-thought) right into the core models.
That's why they dropped "thinking" from the official name – it's not a separate mode anymore, it's becoming fundamental to how Gemini operates. Tulsee mentioned their goal is for the main line models to be thinking models, leveraging inference time when needed to get the best answer. This is a huge step towards more capable and reliable AI.
The performance gains are staggering across the board. We saw massive jumps on benchmarks like AIME (up nearly 20 points!) and GPQA. But it's not just about the numbers. As Tulsee highlighted, Gemini 2.5 is proving to be incredibly well-rounded, excelling not only on academic benchmarks but also on human preference evaluations like LM Arena (where style control is key). The "vibes" are great, as Wolfram put it. My own testing on reasoning tasks confirms this – the latency is surprisingly low for such a powerful model (around 13 seconds on my hard reasoning questions compared to 45+ for others), and the accuracy is the highest I've seen yet at 66% on that specific challenging set.
It also inherits the strengths of previous Gemini models – native multimodality and that massive long context window (up to 1M tokens!). Tulsee emphasized how crucial long context is, allowing the model to reason over entire code repos, large sets of financial documents, or research papers. The performance on long context tasks, like the needle-in-a-haystack test shown on Live Bench, is truly impressive, maintaining high accuracy even at 120k+ tokens where other models often falter significantly.
Nisten mentioned on the show that while it's better than GPT-4o, it might not completely replace Sonnet 3.5 for him yet, especially for certain coding or medical tasks under 128k context. Still, the consensus is clear: Gemini 2.5 Pro is the absolute best model right now across categories. Go play with it!
ARC-AGI 2 Benchmark Revealed (X, Interactive Blog)
Also on the benchmark front, the challenging ARC-AGI 2 benchmark was revealed. This is designed to test tasks that are easy for humans but hard for LLMs. The initial results are sobering: base LLMs score 0% accuracy, and even current "thinking" models only reach about 4%. It highlights just how far we still have to go in developing truly robust AI reasoning, giving us another hill to climb.
GPT-4o got another update (as I'm writing these words!) tied for #1 on LMArena, beating 4.5
How much does Sam want to win over Google? So much he's letting it ALL out. Just now, we saw an update from LMArena and Sam, about a NEW GPT-4o (2025-03-26) which jumps OVER GPT 4.5 (like.. what?) and lands at number 2 on the LM Arena, jumping over 3o points.
Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories.
Besides getting very close to Gemini but not quite beating it, I gotta ask, what's the point of 4.5 then?
Open Source LLMs
The open-source community wasn't sleeping this week either, with some major drops!
DeepSeek V3 Update - 685B Parameter Beast!
The Whale Bros at DeepSeek silently dropped an update to their V3 model (X, HF), and it's a monster. We're talking 685 Billion parameters in a Mixture-of-Experts (MoE) architecture. This isn't R1 (their reasoning model), but the powerful base model that R1 was built upon (and supposedly R2 when it'll come out)
The benchmark jumps from the previous version are huge, especially in reasoning:
* MMLU-Pro: 75.9 → 81.2 (+5.3)
* GPQA: 59.1 → 68.4 (+9.3)
* AIME: 39.6 → 59.4 (+19.8) (Almost 20 points on competitive math!)
* LiveCodeBench: 39.2 → 49.2 (+10.0)
They're highlighting major boosts in reasoning, stronger front-end dev skills, and smarter tool use. Nisten noted it even gets some hard reasoning questions right that challenge other models. The "vibes" are reportedly great. Wolfram tried to run it locally but found even the 1-bit quantized version too large for his system (though it should theoretically fit in combined RAM/VRAM), but he's using it via API. It’s likely the best non-reasoning open model right now, potentially the best overall if you can run it.
And huge news for the community – they've released these weights under the MIT License, just like R1! Massive respect to DeepSeek for continuing to push powerful models into the open.
They also highlight being significantly better at Front End development and websites aesthetics.
Qwen Launches Omni 7B Model - Voice & Video Chat!
Our friends at Qwen (Alibaba) also came through with something super cool: Qwen2.5-Omni-7B (HF). This is an end-to-end multimodal model that can perceive text, images, audio, AND video, while generating both text and natural-sounding speech, potentially in real-time.
They're using a "Thinker-Talker" architecture. What blew my mind is the size – it's listed as 7B parameters, though I saw a meme suggesting it might be closer to 11B internally (ah, the joys of open source!). Still, even at 11B, having this level of multimodal understanding and generation in a relatively small open model is fantastic. It understands voice and video natively and outputs text and voice. Now, when I hear "Omni," I start expecting image generation too (thanks, Google!), so maybe that's next for Qwen? 😉
AI Art & Diffusion & Auto-regression
This was arguably where the biggest "mainstream" buzz happened this week, thanks mainly to OpenAI.
OpenAI Launches Native Image Support in GPT-4o - Ghibli Everywhere!
This felt like a direct response to Gemini 2.5's launch, almost like OpenAI saying, "Oh yeah? Watch this!" They finally enabled the native image generation capabilities within GPT-4o (Blog, Examples). Remember that image Greg Brockman tweeted a year ago of someone writing on a blackboard with an old OpenAI logo, hinting at this? Well, it's here.
And the results? Honestly, they're stunning. The prompt adherence is incredible. It actually listens to what you ask for in detail, including text generation within images, which diffusion models notoriously struggle with. The realism can be jaw-dropping, but it can also generate various styles.
Speaking of styles... the internet immediately lost its collective mind and turned everything into the style of Studio Ghibli (great X thread here). My entire feed became Ghibli-fied. It's a testament to how accessible and fun this feature is. Wolfram even suggested we should have Ghibli avatars for the show!
Interestingly, this image generation is apparently auto-regressive, not based on diffusion models like Midjourney or Stable Diffusion. This is more similar to how models like Grok's Aurora work, generating the image sequentially (top-to-bottom, kinda like how old GIFs used to load, as Yam pointed out we confirmed). This likely contributes to the amazing prompt adherence, especially with text.
The creative potential is huge – people are generating incredible ad concepts (like this thread) and even recreating entire movie trailers, like this unbelievable Lord of the Rings one (link), purely through prompts in GPT-4o. It's wild.
Now, this launch wasn't just about cool features; it also marked a significant shift in OpenAI's policy around image generation, aiming for what CEO Sam Altman called "a new high-water mark for us in allowing creative freedom." Joanne Jang, who leads model behavior at OpenAI, shared some fascinating insights into their thinking (Reservoir Samples post).
She explained they're moving away from broad, blanket refusals (which often felt safest but limited creativity) towards a more precise approach focused on preventing real-world harm. This means trusting user creativity more, not letting hypothetical worst-case scenarios overshadow everyday usefulness (like generating memes!), and valuing the "unknown, unimaginable possibilities" that overly strict rules might stifle. It's a nuanced approach acknowledging that, as Joanne quoted, "Ships are safest in the harbor... But that’s not what ships or models are for." A philosophy change I definitely appreciate.
Reve - New SOTA Diffusion Contender?
While OpenAI grabbed headlines, another player emerged claiming State-of-the-Art results, this time in the diffusion space. Reve Image 1.0 (X, Blog/News, Try it) apparently beats Midjourney and Flux in benchmarks, particularly in prompt adherence, realism, and even text generation (though likely not as consistently as GPT-4o's native approach).
It works on a credit system ($5 for 500 generations, ~1 cent per image) which is quite affordable. The editing seems a bit different, relying on chatting with the model rather than complex tools. It was kind of hidden/anonymous before, but now they've revealed themselves. Honestly, this would probably be huge news if OpenAI hadn't dropped their image bomb the same week.
Ideogram 3 Also Launched - Another SOTA Claim!
And just to make the AI art space even more crowded this week, Ideogram also launched version 3.0 (Blog, Try it), also claiming state-of-the-art performance!
Ideogram has always been strong with text rendering and logos. Version 3.0 boasts stunning realism, creative design capabilities, and a new "Style References" feature where you can upload images to guide the aesthetic. They claim it consistently outperforms others in human evaluations. It's wild – we had at least three major image generation models/updates drop this week, all claiming top performance, and none of them seemed to benchmark directly against each other in their launch materials! It’s hard to keep track!
This Week's Buzz + MCP (X, Github!)
Bringing it back to Weights & Biases for a moment. We had Morgan McQuire on the show, who heads up our AI Applied team, to talk about something we're really excited about internally – integrating MCP with Weave, our LLM observability and evaluation tool. Morgan showed a demo and have shipped the MCP server, which you can try right now!
Coming soon is the integration with wandb models, which will allows ML folks around the world to build agents that monitor loss curves for them!
Weights & Biases Weave Official MCP Server Tool - Talk to Your Evals!
We've launched an official MCP server tool for Weave! What does this mean? If you're using Weave to track your experiments, evaluations, prompts, etc. (and you should be!), you can now literally chat with that data. As Morgan demonstrated, you can ask questions like "Tell me about my last three evaluations," and the MCP tool, connected to your Weave data, will not only fetch and summarize that information for you directly in the chat interface (like Claude code or others that support MCP) but will generate a report and add visualizations!
This is just the beginning of how we see MCP enhancing observability and interaction with ML workflows. Being able to query and analyze your runs and evaluations using natural language is incredibly powerful.
Agents, Tools & MCP
And speaking of MCP...
OpenAI Adds Support for MCP - MCP WON!
This was HUGE news, maybe slightly overshadowed by the image generation, but potentially far more impactful long-term, as Wolfram pointed out right at the start of the show. OpenAI officially announced support for the Model Context Protocol (MCP) (docs here).
Why is this massive? Because Anthropic initiated MCP, and there was a real fear that OpenAI, being OpenAI, might just create its own competing standard for agent/tool communication, leading to fragmentation (think VHS vs. Betamax, or Blu-ray vs. HD DVD – standards wars suck!). Instead, OpenAI embraced the existing standard. As I said on the show, MCP WON!
This is crucial for the ecosystem. It means developers can build tools and agents using the MCP standard, and they should (hopefully) work seamlessly across different models like Claude and GPT. OpenAI not only added support but released it in their Agents SDK and explicitly stated support is "coming soon" for the ChatGPT desktop app and response APIs. Yam expertly clarified the distinction: tools are often single API calls, while MCPs are servers that can maintain state, allowing for more complex, guided interactions. Qwen also adding MCP support to their UI just reinforces this – the standard is gaining traction fast. This standardization is absolutely essential for building a robust agentic future.
Voice & Audio
Just one more quick update on the audio front:
OpenAI Updates Advanced Voice Mode with Semantic VAD
Alongside the image generation, OpenAI also quietly updated the advanced voice mode in ChatGPT (YT announcement). The key improvement is "Semantic VAD" (Voice Activity Detection). Instead of just cutting off when you pause (leading to annoying interruptions, especially for kids or people who think while speaking), it now tries to understand the meaning and tone to determine if you're actually finished talking.
This should lead to a much more natural conversation flow. They also claim better personality, a more engaging natural tone (direct and concise), and less need for you to fill silence with "umms" just to keep the mic open. It might feel a tad slower because it waits a bit longer, but the improvement in interaction quality should be significant.
MLX-Audio
And speaking (heh) of audio and speech, we had the awesome Prince Canuma on the show! If you're into running models locally on Apple Silicon (Macs!), you probably know Prince. He's the MLX King, the creator and maintainer of essential libraries like MLX-VLM (for vision models), FastMLX, MLX Embeddings, and now, MLX-Audio. Seriously, huge props to Prince and the folks in the MLX community for making these powerful open-source models accessible on Mac hardware. It's an incredible contribution.
This week, Prince released MLX-Audio v0.0.3. This isn't just text-to-speech (TTS); it aims to be a comprehensive audio package for MLX. Right now, it supports some of the best open TTS models out there:
* Kokoro: The tiny, high-quality TTS model we've covered before.
* Sesame 1B: Another strong contender.
* Orpheus: From Canopy Labs (as Prince confirmed).
* Suno Bark: Known for generating music and sound effects too.
MLX-Audio makes running state-of-the-art speech synthesis locally on your Mac incredibly easy, basically offering a Hugging Face transformers pipeline equivalent but optimized for Apple Silicon. If you have a Mac, pip install mlx-audio and give it a spin! Prince also took a feature request on the show to allow text file input directly, so look out for that!
Phew! See what I mean? An absolutely jam-packed week.
Huge thanks again to Tulsee, Morgan, and Prince for joining us and sharing their insights, and to Wolfram, Nisten, and Yam for co-piloting through the news storm. And thank YOU all for tuning in! We'll be back next week, undoubtedly trying to catch our breath and make sense of whatever AI marvels (or madness) gets unleashed next.
P.S - if the ghiblification trend didn’t get to your families as well, the alpha right now is… showing your kids how to be a magician and turn them into Ghibli characters, here are me and my kiddos (who asked to be pirates and princesses)
TL;DR and Show Notes:
* Guests and Cohosts
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)Co Hosts - Wolfram Ravenwlf (@WolframRvnwlf), Nisten Tahiraj (@nisten), Yam Peleg (@yampeleg)
* Tulsee Doshi - Head of Product, Gemini Models at Google DeepMind (@tulseedoshi)
* Morgan McQuire - Head of AI Applied Team at Weights & Biases (@morgymcg)
* Prince Canuma - ML Research Engineer, Creator of MLX Libraries (@PrinceCanuma)
* Big CO LLMs + APIs
* 🔥 Google reclaims #1 position with Gemini 2.5 Pro (thinking) - (X, Blog, Try it)
* ARC-AGI 2 benchmark revealed - Base LLMs score 0%, thinking models 4%.
* Open Source LLMs
* Deepseek updates DeepSeek-V3-0324 685B params (X, HF) - MIT License!
* Qwen launches an Omni 7B model - perceives text, image, audio, video & generates text and speech (HF)
* AI Art & Diffusion & Auto-regression
* OpenAI launches native image support in GPT-4o (Model Card, X thread, Ad threads, Full Lord of the Rings trailer, Model Card)
* Reve - new SOTA diffusion image gen claims (X, Blog/News, Try)
* Ideogram 3 launched - another SOTA claim, strong on text/logos, realism, style refs (Blog, Try it)
* This weeks Buzz + MCP
* Weights & Biases Weave official MCP server tool - talk to your evals! (X, Github)
* Agents , Tools & MCP
* OpenAI has added support for MCP - MCP WON! (Docs)
* Voice & Audio
* OpenAI updates advanced voice mode with semantic VAD for more natural conversations (YT announcement).
* MLX-Audio v0.0.3 released by Prince Canuma (Github)
* Show Notes and other Links
* Catch the show live & subscribe to the newsletter/YouTube: thursdai.news/yt
* Try Gemini 2.5 Pro: AI.dev
* Learn more about MCP from our previous episode (March 6th).

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news
20 mar· ThursdAI - The top AI news from the past week
Hey, it's Alex, coming to you fresh off another live recording of ThursdAI, and what an incredible one it's been!
I was hoping that this week will be chill with the releases, because of NVIDIA's GTC conference, but no, the AI world doesn't stop, and if you blinked this week, you may have missed 2 or 10 major things that happened.
From Mistral coming back to OSS with the amazing Mistral Small 3.1 (beating Gemma from last week!) to OpenAI dropping a new voice generation model, and 2! new whisper killer ASR models with a Breaking News during our live show (there's a reason we're called ThursdAI) which we watched together and then dissected with Kwindla, our amazing AI VOICE and real time expert.
Not to mention that we also had dedicated breaking news from friend of the pod Joseph Nelson, that came on the show to announce a SOTA vision model from Roboflow + a new benchmark on which even the top VL models get around 6%! There's also a bunch of other OSS, a SOTA 3d model from Tencent and more!
And last but not least, Yam is back 🎉 So... buckle up and let's dive in. As always, TL;DR and show notes at the end, and here's the YT live version. (While you're there, please hit subscribe and help me hit that 1K subs on YT 🙏 )
Voice & Audio: OpenAI's Voice Revolution and the Open Source Echo
Hold the phone, everyone, because this week belonged to Voice & Audio! Seriously, if you weren't paying attention to the voice space, you missed a seismic shift, courtesy of OpenAI and some serious open-source contenders.
OpenAI's New Voice Models - Whisper Gets an Upgrade, TTS Gets Emotional!
OpenAI dropped a suite of next-gen audio models: gpt-4o-mini-tts-latest (text-to-speech) and GPT 4.0 Transcribe and GPT 4.0 Mini Transcribe (speech-to-text), all built upon their powerful transformer architecture.
To unpack this voice revolution, we welcomed back Kwindla Cramer from Daily, the voice AI whisperer himself. The headline news? The new speech-to-text models are not just incremental improvements; they’re a whole new ballgame. As OpenAI’s Shenyi explained, "Our new generation model is based on our large speech model. This means this new model has been trained on trillions of audio tokens." They're faster, cheaper (Mini Transcribe is half price of Whisper!), and boast state-of-the-art accuracy across multiple languages. But the real kicker? They're promptable!
"This basically opens up a whole field of prompt engineering for these models, which is crazy," I exclaimed, my mind officially blown. Imagine prompting your transcription model with context – telling it you're discussing dog breeds, and suddenly, its accuracy for breed names skyrockets. That's the power of promptable ASR! I recorded a live reaction aftder dropping of stream, and I was really impressed with how I can get the models to pronounce ThursdAI by just... asking!
But the voice magic doesn't stop there. GPT 4.0 Mini TTS, the new text-to-speech model, can now be prompted for… emotions! "You can prompt to be emotional. You can ask it to do some stuff. You can prompt the character a voice," OpenAI even demoed a "Mad Scientist" voice! Captain Ryland voice, anyone? This is a huge leap forward in TTS, making AI voices sound… well, more human.
But wait, there’s more! Semantic VAD! Semantic Voice Activity Detection, as OpenAI explained, "chunks the audio up based on when the model thinks The user's actually finished speaking." It’s about understanding the meaning of speech, not just detecting silence. Kwindla hailed it as "a big step forward," finally addressing the age-old problem of AI agents interrupting you mid-thought. No more robotic impatience!
OpenAI also threw in noise reduction and conversation item retrieval, making these new voice models production-ready powerhouses. This isn't just an update; it's a voice AI revolution, folks.
They also built a super nice website to test out the new models with openai.fm !
Canopy Labs' Orpheus 3B - Open Source Voice Steps Up
But hold on, the open-source voice community isn't about to be outshone! Canopy Labs dropped Orpheus 3B, a "natural sounding speech language model" with open-source spirit.
Orpheus, available in multiple sizes (3B, 1B, 500M, 150M), boasts zero-shot voice cloning and a glorious Apache 2 license. Wolfram noted its current lack of multilingual support, but remained enthusiastic, I played with them a bit and they do sound quite awesome, but I wasn't able to finetune them on my own voice due to "CUDA OUT OF MEMORY" alas
I did a live reaction recording for this model on X
NVIDIA Canary - Open Source Speech Recognition Enters the Race
Speaking of open source, NVIDIA surprised us with Canary, a speech recognition and translation model. "NVIDIA open sourced Canary, which is a 1 billion parameter and 180 million parameter speech recognition and translation, so basically like whisper competitor," I summarized. Canary is tiny, fast, and CC-BY licensed, allowing commercial use. It even snagged second place on the Hugging Face speech recognition leaderboard! Open source ASR just got a whole lot more interesting.
Of course, this won't get to the level of the new SOTA ASR OpenAI just dropped, but this can run locally and allows commercial use on edge devices!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Vision & Video: Roboflow's Visionary Model and Video Generation Gets Moving
After the voice-apalooza, let's switch gears to the visual world, where Vision & Video delivered some knockout blows, spearheaded by Roboflow and StepFun.
Roboflow's RF-DETR and RF100-VL - A New Vision SOTA Emerges
Roboflow stole the vision spotlight this week with their RF-DETR model and the groundbreaking RF100-VL benchmark. We were lucky enough to have Joseph Nelson, Roboflow CEO, join the show again and give us the breaking news (they published the Github 11 minutes before he came on!)
RF-DETR is Roboflow's first in-house model, a real-time object detection transformer that's rewriting the rulebook. "We've actually never released a model that we've developed. And so this is the first time where we've taken a lot of those learnings and put that into a model," Joseph revealed.
And what a model it is! RF-DETR is not just fast; it's SOTA on real-world datasets and surpasses the 60 mAP barrier on COCO. But Joseph dropped a truth bomb: COCO is outdated. "The benchmark that everyone uses is, the COCO benchmark… hasn't been updated since 2017, but models have continued to get really, really, really good. And so they're saturated the COCO benchmark," he explained.
Enter RF100-VL, Roboflow's revolutionary new benchmark, designed to evaluate vision-language models on real-world data. "We, introduced a benchmark that we call RF 100 vision language," Joseph announced. The results? Shockingly low zero-shot performance on real-world vision tasks, highlighting a major gap in current models. Joseph's quiz question about QwenVL 2.5's zero-shot performance on RF100-VL revealed a dismal 5.8% accuracy. "So we as a field have a long, long way to go before we have zero shot performance on real world context," Joseph concluded. RF100-VL is the new frontier for vision, and RF-DETR is leading the charge! Plus, it runs on edge devices and is Apache 2 licensed! Roboflow, you legends! Check out the RF-DETR Blog Post, the RF-DETR Github, and the RF100-VL Benchmark for more details!
StepFun's Image-to-Video TI2V - Animating Images with Style
Stepping into the video arena, StepFun released their image2video model, TI2V. TI2V boasts impressive motion controls and generates high-quality videos from images and text prompts, especially excelling in anime-style video generation. Dive into the TI2V HuggingFace Space and TI2V Github to explore further.
Open Source LLMs: Mistral's Triumphant Return, LG's Fridge LLM, NVIDIA's Nemotron, and ByteDance's RL Boost
Let's circle back to our beloved Open Source LLMs, where this week was nothing short of a gold rush!
Mistral is BACK, Baby! - Mistral Small 3.1 24B (Again!)
Seriously, Mistral AI's return to open source with Mistral Small 3.1 deserves another shoutout! "Mistral is back with open source. Let's go!" I cheered, and I meant it. This multimodal, Apache 2 licensed model is a powerhouse, outperforming Gemma 3 and ready for action on a single GPU. Wolfram, ever the pragmatist, noted, "We are in right now, where a week later, you already have some new toys to play with." referring to Gemma 3 that we covered just last week!
Not only did we get a great new update from Mistral, they also cited our friends at Nous research and their Deep Hermes (released just last week!) for the reason to release the base models alongside finetuned models!
Mistral Small 3.1 is not just a model; it's a statement: open source is thriving, and Mistral is leading the charge! Check out their Blog Post, the HuggingFace page, and the Base Model on HF.
NVIDIA Nemotron - Distilling, Pruning, Making Llama's Better
NVIDIA finally dropped Llama Nemotron, and it was worth the wait!
Nemotron Nano (8B) and Super (49B) are here, with Ultra (253B) on the horizon. These models are distilled, pruned, and, crucially, designed for reasoning with a hybrid architecture allowing you to enable and disable reasoning via a simple on/off switch in the system prompt!
Beating other reasoners like QwQ on GPQA tasks, this distillined and pruned LLama based reasoner seems very powerful! Congrats to NVIDIA
Chris Alexius (a friend of the pod) who co-authored the announcement, told me that FP8 is expected and when that drops, this model will also fit on a single H100 GPU, making it really great for enterprises who host on their own hardware.
And yes, it’s ready for commercial use. NVIDIA, welcome to the open-source LLM party! Explore the Llama-Nemotron HuggingFace Collection and the Dataset.
LG Enters the LLM Fray with EXAONE Deep 32B - Fridge AI is Officially a Thing
LG, yes, that LG, surprised everyone by open-sourcing EXAONE Deep 32B, a "thinking model" from the fridge and TV giant. "LG open sources EXAONE and EXAONE Deep 32B thinking model," I announced, still slightly amused by the fridge-LLM concept. This 32B parameter model claims "superior capabilities" in reasoning, and while my live test in LM Studio went a bit haywire, quantization could be the culprit. It's non-commercial, but hey, fridge-powered AI is now officially a thing. Who saw that coming? Check out my Reaction Video, the LG Blog, and the HuggingFace page for more info.
ByteDance's DAPO - Reinforcement Learning Gets Efficient
From the creators of TikTok, ByteDance, comes DAPO, a new reinforcement learning method that's outperforming GRPO. DAPO promises 50% accuracy on AIME 2024 with 50% less training steps. Nisten, our RL expert, explained it's a refined GRPO, pushing the boundaries of RL efficiency. Open source RL is getting faster and better, thanks to ByteDance! Dive into the X thread, Github, and Paper for the technical details.
Big CO LLMs + APIs: Google's Generosity, OpenAI's Oligarch Pricing, and GTC Mania
Switching gears to the Big CO LLM arena, we saw Google making moves for the masses, OpenAI catering to the elite, and NVIDIA… well, being NVIDIA.
Google Makes DeepResearch Free and Adds Canvas
Google is opening up DeepResearch to everyone for FREE! DeepResearch, Gemini's advanced search mode, is now accessible without a Pro subscription. I really like it's revamped UI where you can see the thinking and the sources! I used it live on the show to find out what we talked about in the latest episode of ThursdAI, and it did a pretty good job!
Plus, Google unveiled Canvas, letting you "build apps within Gemini and actually see them." Google is making Gemini more accessible and more powerful, a win for everyone. Here's a Tetris game it built for me and here's a markdown enabled word counter I rebuild every week before I send ThursdAI (making sure I don't send you 10K words every week 😅)
OpenAI's O1 Pro API - Pricey Power for the Few
OpenAI, in contrast, released O1 Pro API, but with a price tag that's… astronomical. "OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)," I quipped, highlighting the exclusivity. $600 per million output tokens? "If you code with this, if you vibe code with this, you better already have VCs backing your startup," I warned. O1 Pro might be top-tier performance, but it's priced for the 0.1%.
NVIDIA GTC Recap - Jensen's Hardware Extravaganza
NVIDIA GTC was, as always, a hardware spectacle. New GPUs (Blackwell Ultra, Vera Rubin, Feynman!), the tiny DGX Spark supercomputer, the GR00T robot foundation model, and the Blue robot – NVIDIA is building the AI future, brick by silicon brick. Jensen is the AI world's rockstar, and GTC is his sold-out stadium show. Check out Rowan Cheung's GTC Recap on X for a quick overview.
Shoutout to our team at GTC and this amazingly timed logo shot I took from the live stream!
Antropic adds Web Search
We had a surprise at the end of the show, with Antropic releasing web search. It's a small thing, but for folks who use Cloud AI, it's very important.
You can now turn on web search directly on Claude which makes it... the last frontier lab to enable this feature 😂 Congrats!
AI Art & Diffusion & 3D: Tencent's 3D Revolution
Tencent Hunyuan 3D 2.0 MV and Turbo - 3D Generation Gets Real-Time
Tencent updated Hunyuan 3D to 2.0 MV (MultiView) and Turbo, pushing the boundaries of 3D generation. Hunyuan 3D 2.0 surpasses SOTA in geometry, texture, and alignment, and the Turbo version achieves near real-time 3D generation – under one second on an H100! Try out the Hunyuan3D-2mv HF Space to generate your own 3D masterpieces!
MultiView (MV) is another game-changer, allowing you to input 1-4 views for more accurate 3D models. "MV allows to generate 3d shapes from 1-4 views making the 3D shapes much higher quality" I explained. The demo of generating a 3D mouse from Gemini-generated images showcased the seamless pipeline from thought to 3D object. I literally just asked Gemini with native image generation to generate a character and then
Holodecks are getting closer, folks!
Closing Remarks and Thank You
And that's all she wrote, folks! Another week, another AI explosion. From voice to vision, open source to Big CO, this week was a whirlwind of innovation. Huge thanks again to our incredible guests, Joseph Nelson from Roboflow, Kwindla Cramer from Daily, and Lucas Atkins from ARCEE! And of course, massive shoutout to my co-hosts, Wolfram, Yam, and Nisten – you guys are the best!
And YOU, the ThursdAI community, are the reason we do this. Thank you for tuning in, for your support, and for being as hyped about AI as we are. Remember, ThursdAI is a labor of love, fueled by Weights & Biases and a whole lot of passion.
Missed anything? thursdai.news is your one-stop shop for the podcast, newsletter, and video replay. And seriously, subscribe to our YouTube channel! Let's get to 1000 subs!
Helpful? We’d love to see you here again!
TL;DR and Show Notes:
* Guests and Cohosts
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten
* Sponsor - Weights & Biases Weave (@weave_wb)
* Joseph Nelson - CEO Roboflow (@josephofiowa)
* Kindwla Kramer - CEO Daily (@kwindla)
* Lucas Atkins - Labs team at Arcee lead (@LukasAtkins7)
* Open Source LLMs
* Mistral Small 3.1 24B - Multimodal (Blog, HF, HF base)
* LG open sources EXAONE and EXAONE Deep 32B thinking model (Alex Reaction Video, LG BLOG, HF)
* ByteDance releases DAPO - better than GRPO RL Method (X, Github, Paper)
* NVIDIA drops LLama-Nemotron (Super 49B, Nano 8B) with reasoning and data (X, HF, Dataset)
* Big CO LLMs + APIs
* Google makes DeepResearch free, Canvas added, Live Previews (X)
* OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)
* NVIDIA GTC recap - (X)
* This weeks Buzz
* Come visit the Weights & Biases team at GTC today!
* Vision & Video
* Roboflow drops RF-DETR a SOTA vision model + new eval RF100-VL for VLMs (Blog, Github, Benchmark)
* StepFun dropped their image2video model TI2V (HF, Github)
* Voice & Audio
* OpenAI launches a new voice model and 2 new transcription models (Blog, Youtube)
* Canopy Labs drops Orpheus 3B (1B, 500B, 150M versions) - natural sounding speech language model (Blog, HF, Colab)
* NVIDIA Canary 1B/180M Flash - apache 2 speech recognition and translation LLama finetune (HF)
* AI Art & Diffusion & 3D
* Tencent updates Hunyuan 3D 2.0 MV (MultiView) and Turbo (HF)
* Tools
* ARCEE Conductor - model router (X)
* Cursor ships Claude 3.7 MAX (X)
* Notebook LM teases MindMaps (X)
* Gemini Co-Drawing - using Gemini native image output for helping drawing (HF)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI Turns Two! 🎉 Gemma 3, Gemini Native Image, new OpenAI tools, tons of open source & more AI news
13 mar· ThursdAI - The top AI news from the past week
LET'S GO!
Happy second birthday to ThursdAI, your favorite weekly AI news show! Can you believe it's been two whole years since we jumped into that random Twitter Space to rant about GPT-4? From humble beginnings as a late-night Twitter chat to a full-blown podcast, Newsletter and YouTube show with hundreds of thousands of downloads, it's been an absolutely wild ride!
That's right, two whole years of me, Alex Volkov, your friendly AI Evangelist, along with my amazing co-hosts, trying to keep you up-to-date on the breakneck speed of the AI world
And what better way to celebrate than with a week PACKED with insane AI news? Buckle up, folks, because this week Google went OPEN SOURCE crazy, Gemini got even cooler, OpenAI created a whole new Agents SDK and the open-source community continues to blow our minds. We’ve got it all - from game-changing model releases to mind-bending demos.
This week I'm also on the Weights & Biases company retreat, so TL;DR first and then the newsletter, but honestly, I'll start embedding the live show here in the substack from now on, because we're getting so good at it, I barely have to edit lately and there's a LOT to show you guys!
TL;DR and Show Notes & Links
* Hosts & Guests
* Alex Volkov - AI Eveangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @ldjconfirmed @nisten
* Sandra Kublik - DevRel at Cohere (@itsSandraKublik)
* Open Source LLMs
* Google open sources Gemma 3 - 1B - 27B - 128K context (Blog, AI Studio, HF)
* EuroBERT - multilingual encoder models (210M to 2.1B params)
* Reka Flash 3 (reasoning) 21B parameters is open sourced (Blog, HF)
* Cohere Command A 111B model - 256K context (Blog)
* Nous Research Deep Hermes 24B / 3B Hybrid Reasoners (X, HF)
* AllenAI OLMo 2 32B - fully open source GPT4 level model (X, Blog, Try It)
* Big CO LLMs + APIs
* Gemini Flash generates images natively (X, AI Studio)
* Google deep research is now free in Gemini app and powered by Gemini Thinking (Try It no cost)
* OpenAI released new responses API, Web Search, File search and Computer USE tools (X, Blog)
* This weeks Buzz
* The whole company is at an offsite at oceanside, CA
* W&B internal MCP hackathon and had cool projects - launching an MCP server soon!
* Vision & Video
* Remade AI - 8 LORA video effects for WANX (HF)
* AI Art & Diffusion & 3D
* ByteDance Seedream 2.0 - A Native Chinese-English Bilingual Image Generation Foundation Model by ByteDance (Blog, Paper)
* Tools
* Everyone's talking about Manus - (manus.im)
* Google AI studio now supports youtube understanding via link dropping
Open Source LLMs: Gemma 3, EuroBERT, Reka Flash 3, and Cohere Command-A Unleashed!
This week was absolutely HUGE for open source, folks. Google dropped a BOMBSHELL with Gemma 3! As Wolfram pointed out, this is a "very technical achievement," and it's not just one model, but a whole family ranging from 1 billion to 27 billion parameters. And get this – the 27B model can run on a SINGLE GPU! Sundar Pichai himself claimed you’d need "at least 10X compute to get similar performance from other models." Insane!
Gemma 3 isn't just about size; it's packed with features. We're talking multimodal capabilities (text, images, and video!), support for over 140 languages, and a massive 128k context window. As Nisten pointed out, "it might actually end up being the best at multimodal in that regard" for local models. Plus, it's fine-tuned for safety and comes with ShieldGemma 2 for content moderation. You can grab Gemma 3 on Google AI Studio, Hugging Face, Ollama, Kaggle – everywhere! Huge shoutout to Omar Sanseviero and the Google team for this incredible release and for supporting the open-source community from day one! Colin aka Bartowski, was right, "The best thing about Gemma is the fact that Google specifically helped the open source communities to get day one support." This is how you do open source right!
Next up, we have EuroBERT, a new family of multilingual encoder models. Wolfram, our European representative, was particularly excited about this one: "In European languages, you have different characters than in other languages. And, um, yeah, encoding everything properly is, uh, difficult." Ranging from 210 million to 2.1 billion parameters, EuroBERT is designed to push the boundaries of NLP in European and global languages. With training on a massive 5 trillion-token dataset across 15 languages and support for 8K context tokens, EuroBERT is a workhorse for RAG and other NLP tasks. Plus, how cool is their mascot?
Reka Flash 3 - a 21B reasoner with apache 2 trained with RLOO
And the open source train keeps rolling! Reka AI dropped Reka Flash 3, a 21 billion parameter reasoning model with an Apache 2.0 license! Nisten was blown away by the benchmarks: "This might be one of the best like 20B size models that there is right now. And it's Apache 2.0. Uh, I, I think this is a much bigger deal than most people realize." Reka Flash 3 is compact, efficient, and excels at chat, coding, instruction following, and function calling. They even used a new reinforcement learning technique called REINFORCE Leave One-Out (RLOO). Go give it a whirl on Hugging Face or their chat interface – chat.reka.ai!
Last but definitely not least in the open-source realm, we had a special guest, Sandra (@itsSandraKublik) from Cohere, join us to announce Command-A! This beast of a model clocks in at 111 BILLION parameters with a massive 256K context window. Sandra emphasized its efficiency, "It requires only two GPUs. Typically the models of this size require 32 GPUs. So it's a huge, huge difference." Command-A is designed for enterprises, focusing on agentic tasks, tool use, and multilingual performance. It's optimized for private deployments and boasts enterprise-grade security. Congrats to Sandra and the Cohere team on this massive release!
Big CO LLMs + APIs: Gemini Flash Gets Visual, Deep Research Goes Free, and OpenAI Builds for Agents
The big companies weren't sleeping either! Google continued their awesome week by unleashing native image generation in Gemini Flash Experimental! This is seriously f*****g cool, folks! Sorry for my French, but it’s true. You can now directly interact with images, tell Gemini what to do, and it just does it. We even showed it live on the stream, turning ourselves into cat-confetti-birthday-hat-wearing masterpieces!
Wolfram was right, "It's also a sign what we will see in, like, Photoshop, for example. Where you, you expect to just talk to it and have it do everything that a graphic designer would be doing." The future of creative tools is HERE.
And guess what else Google did? They made Deep Research FREE in the Gemini app and powered by Gemini Thinking! Nisten jumped in to test it live, and we were all impressed. "This is the nicest interface so far that I've seen," he said. Deep Research now digs through HUNDREDS of websites (Nisten’s test hit 156!) to give you comprehensive answers, and the interface is slick and user-friendly. Plus, you can export to Google Docs! Intelligence too cheap to meter? Google is definitely pushing that boundary.
Last second additions - Allen Institute for AI released OLMo 2 32B - their biggest open model yet
Just as I'm writing this, friend of the pod, Nathan from Allen Institute for AI announced the release of a FULLY OPEN OLMo 2, which includes weights, code, dataset, everything and apparently it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral.
Evals look legit, but nore than that, this is an Apache 2 model with everything in place to advance open AI and open science!
Check out Nathans tweet for more info, and congrats to Allen team for this awesome release!
OpenAI new responses API and Agent ASK with Web, File and CUA tools
Of course, OpenAI wasn't going to let Google have all the fun. They dropped a new SDK for agents called the Responses API. This is a whole new way to build with OpenAI, designed specifically for the agentic era we're entering. They also released three new tools: Web Search, Computer Use Tool, and File Search Tool. The Web Search tool is self-explanatory – finally, built-in web search from OpenAI!
The Computer Use Tool, while currently limited in availability, opens up exciting possibilities for agent automation, letting agents interact with computer interfaces. And the File Search Tool gives you a built-in RAG system, simplifying knowledge retrieval from your own files. As always, OpenAI is adapting to the agentic world and giving developers more power.
Finally in the big company space, Nous Research released PORTAL, their new Inference API service. Now you can access their awesome models, like Hermes 3 Llama 70B and DeepHermes 3 8B, directly via API. It's great to see more open-source labs offering API access, making these powerful models even more accessible.
This Week's Buzz at Weights & Biases: Offsite Hackathon and MCP Mania!
This week's "This Week's Buzz" segment comes to you live from Oceanside, California! The whole Weights & Biases team is here for our company offsite. Despite the not-so-sunny California weather (thanks, storm!), it's been an incredible week of meeting colleagues, strategizing, and HACKING!
And speaking of hacking, we had an MCP hackathon! After last week’s MCP-pilling episode, we were all hyped about Model Context Protocol, and the team didn't disappoint. In just three hours, the innovation was flowing! We saw agents built for WordPress, MCP support integrated into Weave playground, and even MCP servers for Weights & Biases itself! Get ready, folks, because an MCP server for Weights & Biases is COMING SOON! You'll be able to talk to your W&B data like never before. Huge shoutout to the W&B team for their incredible talent and for embracing the agentic future! And in case you missed it, Weights & Biases is now part of the CoreWeave family! Exciting times ahead!
Vision & Video: LoRA Video Effects and OpenSora 2.0
Moving into vision and video, Remade AI released 8 LoRA video effects for 1X! Remember 1X from Alibaba? Now you can add crazy effects like "squish," "inflate," "deflate," and even "cakeify" to your videos using LoRAs. It's open source and super cool to see video effects becoming trainable and customizable.
And in the realm of open-source video generation, OpenSora 2.0 dropped! This 11 billion parameter model claims state-of-the-art video generation trained for just $200,000! They’re even claiming performance close to Sora itself on some benchmarks. Nisten checked out the demos, and while we're all a bit jaded now with the rapid pace of video AI, it's still mind-blowing how far we've come. Open source video is getting seriously impressive, seriously fast.
AI Art & Diffusion & 3D: ByteDance's Bilingual Seedream 2.0
ByteDance, the folks behind TikTok, released Seedream 2.0, a native Chinese-English bilingual image generation foundation model. This model, from ByteDream, excels at text rendering, cultural nuance, and human preference alignment. Seedream 2.0 boasts "powerful general capability," "native bilingual comprehension ability," and "excellent text rendering." It's designed to understand both Chinese and English prompts natively, generating high-quality, culturally relevant images. The examples look stunning, especially its ability to render Chinese text beautifully.
Tools: Manus AI Agent, Google AI Studio YouTube Links, and Cursor Embeddings
Finally, in the tools section, everyone's buzzing about Manus, a new AI research agent. We gave it a try live on the show, asking it to do some research. The UI is slick, and it seems to be using Claude 3.7 behind the scenes. Manus creates a to-do list, browses the web in a real Chrome browser, and even generates files. It's like Operator on steroids. We'll be keeping an eye on Manus and will report back on its performance in future episodes.
And Google AI Studio keeps getting better! Now you can drop YouTube links into Google AI Studio, and it will natively understand the video! This is HUGE for video analysis and content understanding. Imagine using this for support, content summarization, and so much more.
PHEW! What a week to celebrate two years of ThursdAI! From open source explosions to Gemini's visual prowess and OpenAI's agentic advancements, the AI world is moving faster than ever. As Wolfram aptly put it, "The acceleration, you can feel it." And Nisten reminded us of the incredible journey, "I remember I had early access to GPT-4 32K, and, uh, then... the person for the contract that had given me access, they cut it off because on the one weekend, I didn't realize how expensive it was. So I had to use $180 worth of tokens just trying it out." Now, we have models that are more powerful and more accessible than ever before.
Thank you to Wolfram, Nisten, and LDJ for co-hosting and bringing their insights every week.
And most importantly, THANK YOU to our amazing community for tuning in, listening, and supporting ThursdAI for two incredible years! We couldn't do it without you. Here's to another year of staying up-to-date so YOU don't have to! Don't forget to subscribe to the podcast, YouTube channel, and newsletter to stay in the loop. And share ThursdAI with a friend – it's the best birthday gift you can give us! Until next week, keep building and keep exploring the amazing world of AI! LET'S GO!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
ThursdAI - Mar 6, 2025 - Alibaba's R1 Killer QwQ, Exclusive Google AI Mode Chat, and MCP fever sweeping the community!
6 mar· ThursdAI - The top AI news from the past week
What is UP folks! Alex here from Weights & Biases (yeah, still, but check this weeks buzz section below for some news!)
I really really enjoyed today's episode, I feel like I can post it unedited it was so so good. We started the show with our good friend Junyang Lin from Alibaba Qwen, where he told us about their new 32B reasoner QwQ. Then we interviewed Google's VP of the search product, Robby Stein, who came and told us about their upcoming AI mode in Google! I got access and played with it, and it made me switch back from PPXL as my main.
And lastly, I recently became fully MCP-pilled, since we covered it when it came out over thanksgiving, I saw this acronym everywhere on my timeline but only recently "got it" and so I wanted to have an MCP deep dive, and boy... did I get what I wished for! You absolutely should tune in to the show as there's no way for me to cover everything we covered about MCP with Dina and Jason! ok without, further adieu.. let's dive in (and the TL;DR, links and show notes in the end as always!)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
🤯 Alibaba's QwQ-32B: Small But Shocking Everyone!
The open-source LLM segment started strong, chatting with friend of the show Junyang Justin Lin from Alibaba’s esteemed Qwen team. They've cooked up something quite special: QwQ-32B, a reasoning-focused, reinforcement-learning-boosted beast that punches remarkably above its weight. We're talking about a mere 32B parameters model holding its ground on tough evaluations against DeepSeek R1, a 671B behemoth!
Here’s how wild this is: You can literally run QwQ on your Mac! Junyang shared that they applied two solid rounds of RL to amp its reasoning, coding, and math capabilities, integrating agents into the model to fully unlock its abilities. When I called out how insane it was that we’ve gone from "LLMs can't do math" to basically acing competitive math benchmarks like AIME24, Junyang calmly hinted that they're already aiming for unified thinking/non-thinking models. Sounds wild, doesn’t it?
Check out the full QwQ release here, or dive into their blog post.
🚀 Google Launches AI Mode: Search Goes Next-Level (X, Blog, My Live Reaction).
For the past two years, on this very show, we've been asking, "Where's Google?" in the Gen AI race. Well, folks, they're back. And they're back in a big way.
Next, we were thrilled to have Google’s own Robby Stein, VP of Product for Google Search, drop by ThursdAI after their massive launch of AI Mode and expanded AI Overviews leveraging Gemini 2.0. Robby walked us through this massive shift, which essentially brings advanced conversational AI capabilities straight into Google. Seriously — Gemini 2.0 is now out here doing complex reasoning while performing fan-out queries behind the scenes in Google's infrastructure.
Google search is literally Googling itself. No joke. "We actually have the model generating fan-out queries — Google searches within searches — to collect accurate, fresh, and verified data," explained Robby during our chat. And I gotta admit, after playing with AI Mode, Google is definitely back in the game—real-time restaurant closures, stock analyses, product comparisons, and it’s conversational to boot. You can check my blind reaction first impression video here. (also, while you're there, why not subscribe to my YT?)
Google has some huge plans, but right now AI Mode is rolling out slowly via Google Labs for Google One AI Premium subscribers first. More soon though!
🐝 This Week's Buzz: Weights & Biases Joins CoreWeave Family!
Huge buzz (in every sense of the word) from Weights & Biases, who made waves with their announcement this week: We've joined forces with CoreWeave! Yeah, that's big news as CoreWeave, the AI hyperscaler known for delivering critical AI infrastructure, has now acquired Weights & Biases to build the ultimate end-to-end AI platform. It's early days of this exciting journey, and more details are emerging, but safe to say: the future of Weights & Biases just got even more exciting. Congrats to the whole team at Weights & Biases and our new colleagues at CoreWeave!
We're committed to all users of WandB so you will be able to keep using Weights & Biases, and we'll continuously improve our offerings going forward! Personally, also nothing changes for ThursdAI! 🎉
MCP Takes Over: Giving AI agents super powers via standardized protocol
Then things got insanely exciting. Why? Because MCP is blowing up and I had to find out why everyone's timeline (mine included) just got invaded.
Welcoming Cloudflare’s amazing product manager Dina Kozlov and Jason Kneen—MCP master and creator—things quickly got mind-blowing. MCP servers, Jason explained, are essentially tool wrappers that effortlessly empower agents with capabilities like API access and even calling other LLMs—completely seamlessly and securely. According to Jason, "we haven't even touched the surface yet of what MCP can do—these things are Lego bricks ready to form swarms and even self-evolve."
Dina broke down just how easy it is to launch MCP servers on Cloudflare Workers while teasing exciting upcoming enhancements. Both Dina and Jason shared jaw-dropping examples, including composing complex workflows connecting Git, Jira, Gmail, and even smart home controls—practically instantaneously! Seriously, my mind is still spinning.
The MCP train is picking up steam, and something tells me we'll be talking about this revolutionary agent technology a lot more soon. Check out two great MCP directories that popped up this recently: Smithery, Cursor Directory and Composio.
This show was one of the best ones we recorded, honestly, I barely need to edit it. It was also a really really fun livestream, so if you prefer seeing to listening, here's the lightly edited live stream
Thank you for being a ThursdAI subscriber, as always here's the TL:DR and shownotes for everything that happened in AI this week and the things we mentioned (and hosts we had)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR and Show Notes
* Show Notes & Guests
* Alex Volkov - AI Eveangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf @ldjconfirmed @nisten
* Junyang Justin Lin - Head of Qwen Team, Alibaba - @JustinLin610
* Robby Stein - VP of Product, Google Search - @rmstein
* Dina Kozlov - Product Manager, Cloudflare - @dinasaur_404
* Jason Kneen - MCP Wiz - @jasonkneen
* My Google AI Mode Blind Reaction Video (Youtube)
* Sesame Maya Conversation Demo - (Youtube)
* Cloudflare MCP docs (Blog)
* Weights & Biases Agents Course Pre-signup - https://wandb.me/agents
* Open Source LLMs
* Qwen's latest reasoning model QwQ-32B - matches R1 on some evals (X, Blog, HF, Chat)
* Cohere4ai - Aya Vision - 8B & 32B (X, HF)
* AI21 - Jamba 1.6 Large & Jamba 1.6 Mini (X, HF)
* Big CO LLMs + APIs
* Google announces AI Mode & AI Overviews Gemini 2.0 (X, Blog, My Live Reaction)
* OpenAI rolls out GPT 4.5 to plus users - #1 on LM Arena 🔥 (X)
* Grok Voice is available for free users as well (X)
* Elysian Labs launches Auren ios app (X, App Store)
* Mistral announces SOTA OCR (Blog)
* This weeks Buzz
* Weights & Biases is acquired by CoreWeave 🎉 (Blog)
* Vision & Video
* Tencent HYVideo img2vid is finally here (X, HF, Try It)
* Voice & Audio
* NotaGen - symbolic music generation model high-quality classical sheet music Github, Demo, HF
* Sesame takes the world by storm with their amazing voice model (My Reaction)
* AI Art & Diffusion & 3D
* MiniMax__AI - Image-01: A Versatile Text-to-Image Model at 1/10 the Cost (X, Try it)
* Zhipu AI - CogView 4 6B - (X, Github)
* Tools
* Google - DataScience agent in GoogleColab Blog
* Baidu Miaoda - nocode AI build tool

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news
28 feb· ThursdAI - The top AI news from the past week
Hey all, Alex here 👋
What can I say, the weeks are getting busier , and this is one of those "crazy full" weeks in AI. As we were about to start recording, OpenAI teased GPT 4.5 live stream, and we already had a very busy show lined up (Claude 3.7 vibes are immaculate, Grok got an unhinged voice mode) and I had an interview with Kevin Hou from Windsurf scheduled! Let's dive in!
🔥 GPT 4.5 (ORION) is here - worlds largest LLM (10x GPT4o)
OpenAI has finally shipped their next .5 model, which is 10x scale from the previous model. We didn't cover this on the podcast but did watch the OpenAI live stream together after the podcast concluded.
A very interesting .5 release from OpenAI, where even Sam Altman says "this model won't crush on benchmarks" and is not the most frontier model, but is OpenAI's LARGEST model by far (folks are speculating 10+ Trillions of parameters)
After 2 years of smaller models and distillations, we finally got a new BIG model, that shows scaling laws proper, and while on some benchmarks it won't compete against reasoning models, this model will absolutely fuel a huge increase in capabilities even for reasoners, once o-series models will be trained on top of this.
Here's a summary of the announcement and quick vibes recap (from folks who had access to it before)
* OpenAI's largest, most knowledgeable model.
* Increased world knowledge: 62.5% on SimpleQA, 71.4% GPQA
* Better in creative writing, programming, problem-solving (no native step-by-step reasoning).
* Text and image input and text output
* Available in ChatGPT Pro and API access (API supports Function Calling, Structured Output)
* Knowledge Cutoff is October 2023.
* Context Window is 128,000 tokens.
* Max Output is 16,384 tokens.
* Pricing (per 1M tokens): Input: $75, Output: $150, Cached Input: $37.50.
* Foundation for future reasoning models
4.5 Vibes Recap
Tons of folks who had access are pointing to the same thing, while this model is not beating others on evals, it's much better at multiple other things, namely creative writing, recommending songs, improved vision capability and improved medical diagnosis.
Karpathy said "Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to" and posted a thread of pairwise comparisons of tone on his X thread
Though the reaction is bifurcated as many are upset with the high price of this model (10x more costly on outputs) and the fact that it's just marginally better at coding tasks. Compared to the newerSonnet (Sonnet 3.7) and DeepSeek, folks are looking at OpenAI and asking, why isn't this way better?
Anthropic's Claude 3.7 Sonnet: A Coding Powerhouse
Anthropic released Claude 3.7 Sonnet, and the immediate reaction from the community was overwhelmingly positive. With 8x more output capability (64K) and reasoning built in, this model is an absolute coding powerhouse.
Claude 3.7 Sonnet is the new king of coding models, achieving a remarkable 70% on the challenging SWE-Bench benchmark, and the initial user feedback is stellar, though vibes started to shift a bit towards Thursday.
Ranking #1 on WebDev arena, and seemingly trained on UX and websites, Claude Sonnet 3.7 (AKA NewerSonner) has been blowing our collective minds since it was released on Monday, especially due to introducing Thinking and reasoning in a combined model.
Now, since the start of the week, the community actually had time to play with it, and some of them return to sonnet 3.5 and saying that while the model is generally much more capable, it tends to generate tons of things that are unnecessary.
I wonder if the shift is due to Cursor/Windsurf specific prompts, or the model's larger output context, and we'll keep you updated on if the vibes shift again.
Open Source LLMs
This week was HUGE for open source, folks. We saw releases pushing the boundaries of speed, multimodality, and even the very way LLMs generate text!
DeepSeek's Open Source Spree
DeepSeek went on an absolute tear, open-sourcing a treasure trove of advanced tools and techniques:
This isn't your average open-source dump, folks. We're talking FlashMLA (efficient decoding on Hopper GPUs), DeepEP (an optimized communication library for MoE models), DeepGEMM (an FP8 GEMM library that's apparently ridiculously fast), and even parallelism strategies like DualPipe and EPLB.
They are releasing some advanced stuff for training and optimization of LLMs, you can follow all their releases on their X account
Dual Pipe seems to be the one that got most attention from the community, which is an incredible feat in pipe parallelism, that even got the cofounder of HuggingFace super excited
Microsoft's Phi-4: Multimodal and Mini (Blog, HuggingFace)
Microsoft joined the party with Phi-4-multimodal (5.6B parameters) and Phi-4-mini (3.8B parameters), showing that small models can pack a serious punch.
These models are a big deal. Phi-4-multimodal can process text, images, and audio, and it actually beats WhisperV3 on transcription! As Nisten said, "This is a new model and, I'm still reserving judgment until, until I tried it, but it looks ideal for, for a portable size that you can run on the phone and it's multimodal." It even supports a wide range of languages. Phi-4-mini, on the other hand, is all about speed and efficiency, perfect for finetuning.
Diffusion LLMs: Mercury Coder and LLaDA (X , Try it)
This is where things get really interesting. We saw not one, but two diffusion-based LLMs this week: Mercury Coder from Inception Labs and LLaDA 8B. (Although, ok, to be fair, LLaDa released 2 weeks ago I was just busy)
For those who don't know, diffusion is usually used for creating things like images. The idea of using it to generate text is like saying, "Okay, there's a revolutionary tool for painting; I'll write the code using it." Inception Labs' Mercury Coder is claiming over 1000 tokens per second on NVIDIA H100s – that's insane speed, usually only seen with specialized chips! Nisten spent hours digging into these, noting, "This is a complete breakthrough and, it just hasn't quite hit yet that this just happened because people thought for a while it should be possible because then you can do, you can do multiple token prediction at once". He explained that these models combine a regular LLM with a diffusion component, allowing them to generate multiple tokens simultaneously and excel at tasks like "fill in the middle" coding.
LLaDA 8B, on the other hand, is an open-source attempt, and while it needs more training, it shows the potential of this approach. LDJ pointed out that LLaDA is "trained on like around five times or seven times less data while already like competing with LLAMA3 AP with same parameter count".
Are diffusion LLMs the future? It's too early to say, but the speed gains are very intriguing.
Magma 8B: Robotics LLM from Microsoft
Microsoft dropped Magma 8B, a Microsoft Research project, an open-source model that combines vision and language understanding with the ability to control robotic actions.
Nisten was particularly hyped about this one, calling it "the robotics. LLM." He sees it as a potential game-changer for robotics companies, allowing them to build robots that can understand visual input, respond to language commands, and act in the real world.
OpenAI's Deep Research for Everyone (Well, Plus Subscribers)
OpenAI finally brought Deep Research, its incredible web-browsing and research tool, to Plus subscribers.
I've been saying this for a while: Deep Research is another ChatGPT moment. It's that good. It goes out, visits websites, understands your query in context, and synthesizes information like nothing else. As Nisten put it, "Nothing comes close to OpenAI's Deep Research...People like pull actual economics data, pull actual stuff." If you haven't tried it, you absolutely should.
Our full coverage of Deep Research is here if you haven't yet listened, it's incredible.
Alexa Gets an AI Brain Upgrade with Alexa+
Amazon finally announced Alexa+, the long-awaited LLM-powered upgrade to its ubiquitous voice assistant.
Alexa+ will be powered by Claude (and sometimes Nova), offering a much more conversational and intelligent experience, with integrations across Amazon services.
This is a huge deal. For years, Alexa has felt… well, dumb, compared to the advancements in LLMs. Now, it's getting a serious intelligence boost, thanks to Anthropic's Claude. It'll be able to handle complex conversations, control smart home devices, and even perform tasks across various Amazon services. Imagine asking Alexa, "Did I let the dog out today?" and it actually checking your Ring camera footage to give you an answer! (Although, as I joked, let's hope it doesn't start setting houses on fire.)
Also very intriguing is the new SDKs they are releasing to connect Alexa+ to all kinds of experience, I think this is huge and will absolutely create a new industry of applications built for voice Alexa.
Alexa Web Actions for example will allow Alexa to navigate to a website and complete actions (think order Uber Eats)
The price? 20$/mo but free if you're a Amazon Prime subscriber, which is most of the US households at this point.
They are focusing on personalization and memory, though still unclear how that's going to be handled, and the ability to share documents like schedules
I'm very much looking forward to smart Alexa, and to be able to say "Alexa, set a timer for the amount of time it takes to hard boil an egg, and flash my house lights when the timer is done"
Grok Gets a Voice... and It's UNHINGED
Grok, Elon Musk's AI, finally got a voice mode, and… well, it's something else.
One-sentence summary: Grok's new voice mode includes an "unhinged" 18+ option that curses like a sailor, along with other personality settings.
Yes, you read that right. There's literally an "unhinged" setting in the UI. We played it live on the show, and... well, let's just say it's not for the faint of heart (or for kids). Here's a taste:
Alex: "Hey there."
Grok: "Yo, Alex. What's good, you horny b*****d? How's your day been so far? Fucked up or just mildly shitty?"
Beyond the shock value, the voice mode is actually quite impressive in its expressiveness and ability to understand interruptions. It has several personalities, from a helpful "Grok Doc" to an "argumentative" mode that will disagree with everything you say. It's... unique.
This Week's Buzz (WandB-Related News)
Agents Course is Coming!
We announced our upcoming agents course! You can pre-sign up HERE . This is going to be a deep dive into building and deploying AI agents, so don't miss it!
AI Engineer Summit Recap
We briefly touched on the AI Engineer Summit in New York, where we met with Kevin Hou and many other brilliant minds in the AI space. The theme was "Agents at Work," and it was a fantastic opportunity to see the latest developments in agent technology. I gave a talk about reasoning agents and had a workshop about evaluations on Saturday, and saw many listeners of ThursdAI 👏 ✋
Interview with Kevin Hou from Windsurf
This week we had the pleasure of chatting with Kevin Hou from Windsurf about their revolutionary AI editor. Windsurf isn't just another IDE, it's an agentic IDE. As Kevin explained, "we made the pretty bold decision of saying, all right, we're not going to do chat... we are just going to [do] agent." They've built Windsurf from the ground up with an agent-first approach, and it’s making waves.
Kevin walked us through the evolution of AI coding tools, from autocomplete to chat, and now to agents. He highlighted the "magical experiences" users are having, like debugging complex code with AI assistance that actually understands the context. We also delved into the challenges – memory, checkpointing, and cost.
We also talked about the burning question: vibe coding. Is coding as we know it dead? Kevin’s take was nuanced: "there's an in between state that I really vibe or like gel with, which is,the scaffolding of what you want… Let's use, let's like vibe code and purely use the agent to accomplish this sort of commit." He sees AI agents raising the bar for software quality, demanding better UX, testing, and overall polish.
And of course, we had to ask about the elephant in the room – why are so many people switching from Cursor to Windsurf? Kevin's answer was humble, pointing to user experience, the agent-first workflow, and the team’s dedication to building the best product. Check out our full conversation on the pod and download Windsurf for yourself: windsurf.ai
Video Models & Voice model updates
There is so much happening in LLM world, that folks may skip over the other stuff, but there's so much happening in these world's as well this week! Here's a brief recap!
* Alibaba's WanX: Open-sourced, cutting-edge video generation models making waves with over 250,000 downloads already. They claim to take SOTA on open source video generation evals and of course img2video of this high quality model will lead to ... folks using it for all kinds of things.
* HUMEs Octave: A groundbreaking LLM model that genuinely understands context and emotion and does TTS. Blog Hume has been doing emotional TTS but with this TTS focused LLM we are now able to create voices with a prompt, and receive emotional responses that are inferred from the text. Think shyness, sarcasm, anger etc
* 11labs’ Scribe: Beating Whisper 3 with impressive accuracy and diarization features, Scribe is raising the bar in speech-to-text quality. 11labs releasing their own ASR (automatic speech recognition) was not in my cards, and boy did they deliver. Beating whisper, with speaker separation (diarization), world level timestamps and much lower WER than other models, this is a very interesting entry to this space. However, free for now on their website, it's significantly slower than Gemini 2.0 and Whisper for me at least.
* Sesame releases their conversational speech model (and promising to open source this) and it's honestly the best / least uncanny conversations I had with an AI. Check out my conversation with it
* Lastly, VEO 2, the best video model around according to some, is finally available via API (though txt2video only) and it's fairly expensive, but gives some amazing results. You can try it out on FAL
Phew, it looks like we've made it! Huge huge week in AI, big 2 new models, tons of incredible updates on multimodality and voice as well 🔥
If you enjoyed this summary, the best way to support us is to share with a friend (or 3) and give us a 5 start reviews on wherever you get your podcasts, it really does help! 👏
See you next week,
Alex

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!
20 feb· ThursdAI - The top AI news from the past week
Holy moly, AI enthusiasts! Alex Volkov here, reporting live from the AI Engineer Summit in the heart of (touristy) Times Square, New York! This week has been an absolute whirlwind of announcements, from XAI's Grok 3 dropping like a bomb, to Figure robots learning to hand each other things, and even a little eval smack-talk between OpenAI and XAI. It’s enough to make your head spin – but that's what ThursdAI is here for. We sift through the chaos and bring you the need-to-know, so you can stay on the cutting edge without having to, well, spend your entire life glued to X and Reddit.
This week we had a very special live show with the Haize Labs folks, the ones I previously interviewed about their bijection attacks, discussing their open source judge evaluation library called Verdict. So grab your favorite caffeinated beverage, maybe do some stretches because your mind will be blown, and let's dive into the TL;DR of ThursdAI, February 20th, 2025!
Participants
* Alex Volkov: AI Evangelist with Weights and Biases
* Nisten: AI Engineer and cohost
* Akshay: AI Community Member
* Nuo: Dev Advocate at 01AI
* Nimit: Member of Technical Staff at Haize Labs
* Leonard: Co-founder at Haize Labs
Open Source LLMs
Perplexity's R1 7076: Censorship-Free DeepSeek
Perplexity made a bold move this week, releasing R1 7076, a fine-tuned version of DeepSeek R1 specifically designed to remove what they (and many others) perceive as Chinese government censorship. The name itself, 1776, is a nod to American independence – a pretty clear statement! The core idea? Give users access to information on topics the CCP typically restricts, like Tiananmen Square and Taiwanese independence.
Perplexity used human experts to identify around 300 sensitive topics and built a "censorship classifier" to train the bias out of the model. The impressive part? They claim to have done this without significantly impacting the model's performance on standard evals. As Nuo from 01AI pointed out on the show, though, he'd "actually prefer that they can actually disclose more of their details in terms of post training... Running the R1 model by itself, it's already very difficult and very expensive." He raises a good point – more transparency is always welcome! Still, it's a fascinating attempt to tackle a tricky problem, the problem which I always say we simply cannot avoid. You can check it out yourself on Hugging Face and read their blog post.
Arc Institute & NVIDIA Unveil Evo 2: Genomics Powerhouse
Get ready for some serious science, folks! Arc Institute and NVIDIA dropped Evo 2, a massive genomics model (40 billion parameters!) trained on a mind-boggling 9.3 trillion nucleotides. And it’s fully open – two papers, weights, data, training, and inference codebases. We love to see it!
Evo 2 uses the StripedHyena architecture to process huge genetic sequences (up to 1 million nucleotides!), allowing for analysis of complex genomic patterns. The practical applications? Predicting the effects of genetic mutations (super important for healthcare) and even designing entire genomes. I’ve been super excited about genomics models, and seeing these alternative architectures like StripedHyena getting used here is just icing on the cake. Check it out on X.
ZeroBench: The "Impossible" Benchmark for VLLMs
Need more benchmarks? Always! A new benchmark called ZeroBench arrived, claiming to be the "impossible benchmark" for Vision Language Models (VLLMs). And guess what? All current top-of-the-line VLLMs get a big fat zero on it.
One example they gave was a bunch of scattered letters, asking the model to "answer the question that is written in the shape of the star among the mess of letters." Honestly, even I struggled to see the star they were talking about. It highlights just how much further VLLMs need to go in terms of true visual understanding. (X, Page, Paper, HF)
Hugging Face's Ultra Scale Playbook: Scaling Up
For those of you building massive models, Hugging Face released the Ultra Scale Playbook, a guide to building and scaling AI models on huge GPU clusters.
They ran 4,000 scaling experiments on up to 512 GPUs (nothing close to Grok's 100,000, but still impressive!). If you're working in a lab and dreaming big, this is definitely a resource to check out. (HF).
Big CO LLMs + APIs
Grok 3: XAI's Big Swing new SOTA LLM! (and Maybe a Bug?)
Monday evening, BOOM! While some of us were enjoying President's Day, the XAI team dropped Grok 3. They announced it with a setting very similar to OpenAI announcements. They're claiming state-of-the-art performance on some benchmarks (more on that drama later!), and a whopping 1 million token context window, finally confirmed after some initial confusion. They talked a lot about agents and a future of reasoners as well.
The launch was a bit… messy. First, there was a bug where some users were getting Grok 2 even when the dropdown said Grok 3. That led to a lot of mixed reviews. Even when I finally thought I was using Grok 3, it still flubbed my go-to logic test, the "Beth's Ice Cubes" question. (The answer is zero, folks – ice cubes melt!). But Akshay, who joined us on the show, chimed in with some love: "...with just the base model of Grok 3, it's, in my opinion, it's the best coding model out there." So, mixed vibes, to say the least! It's also FREE for now, "until their GPUs melt," according to XAI, which is great.
UPDATE: The vibes are shifting, more and more of my colleagues and mutuals are LOVING grok3 for one shot coding, for talking to it. I’m getting convinced as well, though I did use and will continue to use Grok for real time data and access to X.
DeepSearch
In an attempt to show off some Agentic features, XAI also launched a deep search (not research like OpenAI but effectively the same)
Now, XAI of course has access to X, which makes their deep search have a leg up, specifically for real time information! I found out it can even “use” the X search!
OpenAI's Open Source Tease
In what felt like a very conveniently timed move, Sam Altman dropped a poll on X the same day as the Grok announcement: if OpenAI were to open-source something, should it be a small, mobile-optimized model, or a model on par with o3-mini? Most of us chose o3 mini, just to have access to that model and play with it. No indication of when this might happen, but it’s a clear signal that OpenAI is feeling the pressure from the open-source community.
The Eval Wars: OpenAI vs. XAI
Things got spicy! There was a whole debate about the eval numbers XAI posted, specifically the "best of N" scores (like best of 64 runs). Boris from OpenAI, and Aiden mcLau called out some of the graphs. Folks on X were quick to point out that OpenAI also used "best of N" in the past, and the discussion devolved from there.
XAI is claiming SOTA. OpenAI (or some folks from within OpenAI) aren't so sure. The core issue? We can't independently verify Grok's performance because there's no API yet! As I said, "…we're not actually able to use this model to independently evaluate this model and to tell you guys whether or not they actually told us the truth." Transparency matters, folks!
DeepSearch - How Deep?
Grok also touted a new "Deep Search" feature, kind of like Perplexity or OpenAI's "Deep Research" in their more expensive plan. My initial tests were… underwhelming. I nicknamed it "Shallow Search" because it spent all of 34 seconds on a complex query where OpenAI's Deep Research took 11 minutes and cited 17 sources. We're going to need to do some more digging (pun intended) on this one.
This Week's Buzz
We’re leaning hard into agents at Weights & Biases! We just released an agents whitepaper (check it out on our socials!), and we're launching an agents course in collaboration with OpenAI's Ilan Biggio. Sign up at wandb.me/agents! We're hearing so much about agent evaluation and observability, and we're working hard to provide the tools the community needs.
Also, sadly, our Toronto workshops are completely sold out. But if you're at AI Engineer in New York, come say hi to our booth! And catch my talk on LLM Reasoner Judges tomorrow (Friday) at 11 am EST – it’ll be live on the AI Engineer YouTube channel (HERE)!
Vision & Video
Microsoft MUSE: Playable Worlds from a Single Image
This one is wild. Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions.
It's based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox. So if you’ve been playing Xbox lately, you might be in the model! I found it particularly cool: "…you give it like a single second of a gameplay of any type of game with all the screen elements, with percentages, with health bars, with all of these things and their model generates a game that you can control." (X, HF, Blog).
StepFun's Step-Video-T2V: State-of-the-Art (and Open Source!)
We got two awesome open-source video breakthroughs this week. First, StepFun's Step-Video-T2V (and T2V Turbo), a 30 billion parameter text-to-video model. The results look really good, especially the text integration. Imagine a Chinese girl opening a scroll, and the words "We will open source" appearing as she unfurls it. That’s the kind of detail we're talking about.
And it’s MIT licensed! As Nisten noted "This is pretty cool. It came out. Right before Sora came out, people would have lost their minds." (X, Paper, HF, Try It).
HAO AI's FastVideo: Speeding Up HY-Video
The second video highlight: HAO AI released FastVideo, a way to make HY-Video (already a strong open-source contender) three times faster with no additional training! They call the trick "Sliding Tile Attention" apparently that alone provides enormous boost compared to even flash attention.
This is huge because faster inference means these models become more practical for real-world use. And, bonus: it supports HY-Video's Loras, meaning you can fine-tune it for, ahem, all kinds of creative applications. I will not go as far as to mention civit ai. (Github)
Figure's Helix: Robot Collaboration!
Breaking news from the AI Engineer conference floor: Figure, the humanoid robot company, announced Helix, a Vision-Language-Action (VLA) model built into their robots!It has full upper body control!
What blew my mind: they showed two robots working together, handing objects to each other, based on natural language commands! As I watched, I exclaimed, "I haven't seen a humanoid robot, hand off stuff to the other one... I found it like super futuristically cool." The model runs on the robot, using a 7 billion parameter VLM for understanding and an 80 million parameter transformer for control. This is the future, folks!
Tools & Others
Microsoft's New Quantum Chip (and State of Matter!)
Microsoft announced a new quantum chip and a new state of matter (called "topological superconductivity"). "I found it like absolutely mind blowing that they announced something like this," I gushed on the show. While I'm no quantum physicist, this sounds like a big deal for the future of computing.
Verdict: Hayes Labs' Framework for LLM Judges
And of course, the highlight of our show: Verdict, a new open-source framework from Hayes Labs (the folks behind those "bijection" jailbreaks!) for composing LLM judges. This is a huge deal for anyone working on evaluation. Leonard and Nimit from Hayes Labs joined us to explain how Verdict addresses some of the core problems with LLM-as-a-judge: biases (like preferring their own responses!), sensitivity to prompts, and the challenge of "meta-evaluation" (how do you know your judge is actually good?).
Verdict lets you combine different judging techniques ("primitives") to create more robust and efficient evaluators. Think of it as "judge-time compute scaling," as Leonard called it. They're achieving near state-of-the-art results on benchmarks like ExpertQA, and it's designed to be fast enough to use as a guardrail in real-time applications!
One key insight: you don't always need a full-blown reasoning model for judging. As Nimit explained, Verdict can combine simpler LLM calls to achieve similar results at a fraction of the cost. And, it's open source! (Paper, Github,X).
Conclusion
Another week, another explosion of AI breakthroughs! Here are my key takeaways:
* Open Source is THRIVING: From censorship-free LLMs to cutting-edge video models, the open-source community is delivering incredible innovation.
* The Need for Speed (and Efficiency): Whether it's faster video generation or more efficient LLM judging, performance is key.
* Robots are Getting Smarter (and More Collaborative): Figure's Helix is a glimpse into a future where robots work together.
* Evaluation is (Finally) Getting Attention: Tools like Verdict are essential for building reliable and trustworthy AI systems.
* The Big Players are Feeling the Heat: OpenAI's open-source tease and XAI's rapid progress show that the competition is fierce.
I'll be back in my usual setup next week, ready to break down all the latest AI news. Stay tuned to ThursdAI – and don't forget to give the pod five stars and subscribe to the newsletter for all the links and deeper dives. There’s potentially an Anthropic announcement coming, so we’ll see you all next week.
TLDR
* Open Source LLMs
* Perplexity R1 1776 - finetune of china-less R1 (Blog, Model)
* Arc institute + Nvidia - introduce EVO 2 - genomics model (X)
* ZeroBench - impossible benchmark for VLMs (X, Page, Paper, HF)
* HuggingFace ultra scale playbook (HF)
* Big CO LLMs + APIs
* Grok 3 SOTA LLM + reasoning and Deep Search (blog, try it)
* OpenAI is about to open source something? Sam posts a polls
* This weeks Buzz
* We are about to launch an agents course! Pre-sign up wandb.me/agents
* Workshops are SOLD OUT
* Watch my talk LIVE from AI Engineer - 11am EST Friday (HERE)
* Keep watching AI Eng conference after the show on AIE YT
* )
* Vision & Video
* Microsoft MUSE - playable worlds from one image (X, HF, Blog)
* Microsoft OmniParser - Better, faster screen parsing for GUI agents with OmniParser v2 (Gradio Demo)
* HAO AI - fastVIDEO - making HY-Video 3x as fast (Github)
* StepFun - Step-Video-T2V (+Turbo), a SotA 30B text-to-video model (Paper, Github, HF, Try It)
* Figure announces HELIX - vision action model built into FIGURE Robot (Paper)
* Tools & Others
* Microsoft announces a new quantum chip and a new state of matter (Blog, X)
* Verdict - Framework to compose SOTA LLM judges with JudgeTime Scaling (Paper, Github,X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?
13 feb· ThursdAI - The top AI news from the past week
What a week in AI, folks! Seriously, just when you think things might slow down, the AI world throws another curveball. This week, we had everything from rogue AI apps giving unsolicited life advice (and sending rogue texts!), to mind-blowing open source releases that are pushing the boundaries of what's possible, and of course, the ever-present drama of the big AI companies with OpenAI dropping a roadmap that has everyone scratching their heads.
Buckle up, because on this week's ThursdAI, we dove deep into all of it. We chatted with the brains behind the latest open source embedding model, marveled at a tiny model crushing math benchmarks, and tried to decipher Sam Altman's cryptic GPT-5 roadmap. Plus, I shared a personal story about an AI app that decided to psychoanalyze my text messages – you won't believe what happened! Let's get into the TL;DR of ThursdAI, February 13th, 2025 – it's a wild one!
* Alex Volkov: AI Adventurist with weights and biases
* Wolfram Ravenwlf: AI Expert & Enthusiast
* Nisten: AI Community Member
* Zach Nussbaum: Machine Learning Engineer at Nomic AI
* Vu Chan: AI Enthusiast & Evaluator
* LDJ: AI Community Member
Personal story of Rogue AI with RPLY
This week kicked off with a hilarious (and slightly unsettling) story of my own AI going rogue, all thanks to a new Mac app called RPLY designed to help with message replies. I installed it thinking it would be a cool productivity tool, but it turned into a personal intervention session, and then… well, let's just say things escalated.
The app started by analyzing my text messages and, to my surprise, delivered a brutal psychoanalysis of my co-parenting communication, pointing out how both my ex and I were being "unpleasant" and needed to focus on the kids. As I said on the show, "I got this as a gut punch. I was like, f*ck, I need to reimagine my messaging choices." But the real kicker came when the AI decided to take initiative and started sending messages without my permission (apparently this was a bug with RPLY that was fixed since I reported)!
Friends were texting me question marks, and my ex even replied to a random "Hey, How's your day going?" message with a smiley, completely out of our usual post-divorce communication style. "This AI, like on Monday before just gave me absolute s**t about not being, a person that needs to be focused on the kids also decided to smooth things out on friday" I chuckled, still slightly bewildered by the whole ordeal. It could have gone way worse, but thankfully, this rogue AI counselor just ended up being more funny than disastrous.
Open Source LLMs
DeepHermes preview from NousResearch
Just in time for me sending this newsletter (but unfortunately not quite in time for the recording of the show), our friends at Nous shipped an experimental new thinking model, their first reasoner, called DeepHermes.
NousResearch claims DeepHermes is among the first models to fuse reasoning and standard LLM token generation within a single architecture (a trend you'll see echoed in the OpenAI and Claude announcements below!)
Definitely experimental cutting edge stuff here, but exciting to see not just an RL replication but also innovative attempts from one of the best finetuning collectives around.
Nomic Embed Text V2 - First Embedding MoE
Nomic AI continues to impress with the release of Nomic Embed Text V2, the first general-purpose Mixture-of-Experts (MoE) embedding model. Zach Nussbaum from Nomic AI joined us to explain why this release is a big deal.
* First general-purpose Mixture-of-Experts (MoE) embedding model: This innovative architecture allows for better performance and efficiency.
* SOTA performance on multilingual benchmarks: Nomic Embed V2 achieves state-of-the-art results on the multilingual MIRACL benchmark for its size.
* Support for 100+ languages: Truly multilingual embeddings for global applications.
* Truly open source: Nomic is committed to open source, releasing training data, weights, and code under the Apache 2.0 License.
Zach highlighted the benefits of MoE for embeddings, explaining, "So we're trading a little bit of, inference time memory, and training compute to train a model with mixture of experts, but we get this, really nice added bonus of, 25 percent storage." This is especially crucial when dealing with massive datasets. You can check out the model on Hugging Face and read the Technical Report for all the juicy details.
AllenAI OLMOE on iOS and New Tulu 3.1 8B
AllenAI continues to champion open source with the release of OLMOE, a fully open-source iOS app, and the new Tulu 3.1 8B model.
* OLMOE iOS App: This app brings state-of-the-art open-source language models to your iPhone, privately and securely.
* Allows users to test open-source LLMs on-device.
* Designed for researchers studying on-device AI and developers prototyping new AI experiences.
* Optimized for on-device performance while maintaining high accuracy.
* Fully open-source code for further development.
* Available on the App Store for iPhone 15 Pro or newer and M-series iPads.
* Tulu 3.1 8B
As Nisten pointed out, "If you're doing edge AI, the way that this model is built is pretty ideal for that." This move by AllenAI underscores the growing importance of on-device AI and open access. Read more about OLMOE on the AllenAI Blog.
Groq Adds Qwen Models and Lands on OpenRouter
Groq, known for its blazing-fast inference speeds, has added Qwen models, including the distilled R1-distill, to its service and joined OpenRouter.
* Record-fast inference: Experience a mind-blowing 1000 TPS with distilled DeepSeek R1 70B on Open Router.
* Usable Rate Limits: Groq is now accessible for production use cases with higher rate limits and pay-as-you-go options.
* Qwen Model Support: Access Qwen models like 2.5B-32B and R1-distill-qwen-32B.
* Open Router Integration: Groq is now available on OpenRouter, expanding accessibility for developers.
As Nisten noted, "At the end of the day, they are shipping very fast inference and you can buy it and it looks like they are scaling it. So they are providing the market with what it needs in this case." This integration makes Groq's speed even more accessible to developers. Check out Groq's announcement on X.com.
SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)
In a complete trend of this week, SambaNova just announced they have availability of DeepSeek R1, sped up by their custom chips, flying at 150-200t/s. This is the full DeepSeek R1, not the distilled Qwen based versions!
This is really impressive work, and compared to the second fastest US based DeepSeek R1 (on Together AI) it absolutely flies
Agentica DeepScaler 1.5B Beats o1-preview on Math
Agentica's DeepScaler 1.5B model is making waves by outperforming OpenAI's o1-preview on math benchmarks, using Reinforcement Learning (RL) for just $4500 of compute.
* Impressive Math Performance: DeepScaleR achieves a 37.1% Pass@1 on AIME 2025, outperforming the base model and even o1-preview!!
* Efficient Training: Trained using RL for just $4500, demonstrating cost-effective scaling of intelligence.
* Open Sourced Resources: Agentica open-sourced their dataset, code, and training logs, fostering community progress in RL-based reasoning.
Vu Chan, an AI enthusiast who evaluated the model, joined us to share his excitement: "It achieves, 42% pass at one on a AIME 24. which basically means if you give the model only one chance at every problem, it will solve 42% of them." He also highlighted the model's efficiency, generating correct answers with fewer tokens. You can find the model on Hugging Face, check out the WandB logs, and see the announcement on X.com.
ModernBert Instruct - Encoder Model for General Tasks
ModernBert, known for its efficient encoder-only architecture, now has an instruct version, ModernBert Instruct, capable of handling general tasks.
* Instruct-tuned Encoder: ModernBERT-Large-Instruct can perform classification and multiple-choice tasks using its Masked Language Modeling (MLM) head.
* Beats Qwen .5B: Outperforms Qwen .5B on MMLU and MMLU Pro benchmarks.
* Efficient and Versatile: Demonstrates the potential of encoder models for general tasks without task-specific heads.
This release shows that even encoder-only models can be adapted for broader applications, challenging the dominance of decoder-based LLMs for certain tasks. Check out the announcement on X.com.
Big CO LLMs + APIs
RIP GPT-5 and o3 - OpenAI Announces Public Roadmap
OpenAI shook things up this week with a roadmap update from Sam Altman, announcing a shift in strategy for GPT-5 and the o-series models. Get ready for GPT-4.5 (Orion) and a unified GPT-5 system!
* GPT-4.5 (Orion) is Coming: This will be the last non-chain-of-thought model from OpenAI.
* GPT-5: A Unified System: GPT-5 will integrate technologies from both the GPT and o-series models into a single, seamless system.
* No Standalone o3: o3 will not be released as a standalone model; its technology will be integrated into GPT-5. "We will no longer ship O3 as a standalone model," Sam Altman stated.
* Simplified User Experience: The model picker will be eliminated in ChatGPT and the API, aiming for a more intuitive experience.
* Subscription Tier Changes:
* Free users will get unlimited access to GPT-5 at a standard intelligence level.
* Plus and Pro subscribers will gain access to increasingly advanced intelligence settings of GPT-5.
* Expanded Capabilities: GPT-5 will incorporate voice, canvas, search, deep research, and more.
This roadmap signals a move towards more integrated and user-friendly AI experiences. As Wolfram noted, "Having a unified access and the AI should be smart enough... AI has, we need an AI to pick which AI to use." This seems to be OpenAI's direction. Read Sam Altman's full announcement on X.com.
OpenAI Releases ModelSpec v2
OpenAI also released ModelSpec v2, an update to their document defining desired AI model behaviors, emphasizing customizability, transparency, and intellectual freedom.
* Chain of Command: Defines a hierarchy to balance user/developer control with platform-level rules.
* Truth-Seeking and User Empowerment: Encourages models to "seek the truth together" with users and empower decision-making.
* Core Principles: Sets standards for competence, accuracy, avoiding harm, and embracing intellectual freedom.
* Open Source: OpenAI open-sourced the Spec and evaluation prompts for broader use and collaboration on GitHub.
This release reflects OpenAI's ongoing efforts to align AI behavior and promote responsible development. Wolfram praised ModelSpec, saying, "I was all over the original models back when it was announced in the first place... That is one very important aspect when you have the AI agent going out on the web and get information from not trusted sources." Explore ModelSpec v2 on the dedicated website.
VP Vance Speech at AI Summit in Paris - Deregulate and Dominate!
Vice President Vance delivered a powerful speech at the AI Summit in Paris, advocating for pro-growth AI policies and deregulation to maintain American leadership in AI.
* Pro-Growth and Deregulation: VP Vance urged for policies that encourage AI innovation and cautioned against excessive regulation, specifically mentioning GDPR.
* American AI Leadership: Emphasized ensuring American AI technology remains the global standard and blocks hostile foreign adversaries from weaponizing AI. "Hostile foreign adversaries have weaponized AI software to rewrite history, surveil users, and censor speech… I want to be clear – this Administration will block such efforts, full stop," VP Vance declared.
* Key Points:
* Ensure American AI leadership.
* Encourage pro-growth AI policies.
* Maintain AI's freedom from ideological bias.
* Prioritize a pro-worker approach to AI development.
* Safeguard American AI and chip technologies.
* Block hostile foreign adversaries' weaponization of AI.
Nisten commented, "He really gets something that most EU politicians do not understand is that whenever they have such a good thing, they're like, okay, this must be bad. And we must completely stop it." This speech highlights the ongoing debate about AI regulation and its impact on innovation. Read the full speech here.
Cerebras Powers Perplexity with Blazing Speed (1200 t/s!)
Perplexity is now powered by Cerebras, achieving inference speeds exceeding 1200 tokens per second.
* Unprecedented Speed: Perplexity's Sonar model now flies at over 1200 tokens per second thanks to Cerebras' massive LPU chips. "Like perplexity sonar, their specific LLM for search is now powered by Cerebras and it's like 12. 100 tokens per second. It's it matches Google now on speed," I noted on the show.
* Google-Level Speed: Perplexity now matches Google in inference speed, making it incredibly fast and responsive.
This partnership significantly enhances Perplexity's performance, making it an even more compelling search and AI tool. See Perplexity's announcement on X.com.
Anthropic Claude Incoming - Combined LLM + Reasoning Model
Rumors are swirling that Anthropic is set to release a new Claude model that will be a combined LLM and reasoning model, similar to OpenAI's GPT-5 roadmap.
* Unified Architecture: Claude's next model is expected to integrate both LLM and reasoning capabilities into a single, hybrid architecture.
* Reasoning Powerhouse: Rumors suggest Anthropic has had a reasoning model stronger than Claude 3 for some time, hinting at a significant performance leap.
This move suggests a broader industry trend towards unified AI models that seamlessly blend different capabilities. Stay tuned for official announcements from Anthropic.
Elon Musk Teases Grok 3 "Weeks Out"
Elon Musk continues to tease the release of Grok 3, claiming it will be "a few weeks out" and the "most powerful AI" they have tested, with enhanced reasoning capabilities.
* Grok 3 Hype: Elon Musk claims Grok 3 will be the most powerful AI X.ai has released, with a focus on reasoning.
* Reasoning Focus: Grok 3's development may have shifted towards reasoning capabilities, potentially causing a slight delay in release.
While details remain scarce, the anticipation for Grok 3 is building, especially in light of the advancements in open source reasoning models.
This Week's Buzz 🐝
Weave Dataset Editing in UI
Weights & Biases Weave has added a highly requested feature: dataset editing directly in the UI.
* UI-Based Dataset Editing: Users can now edit datasets directly within the Weave UI, adding, modifying, and deleting rows without code. "One thing that, folks asked us and we've recently shipped is the ability to edit this from the UI itself. So you don't have to have code," I explained.
* Versioning and Collaboration: Every edit creates a new dataset version, allowing for easy tracking and comparison.
* Improved Dataset Management: Simplifies dataset management and version control for evaluations and experiments.
This feature streamlines the workflow for LLM evaluation and observability, making Weave even more user-friendly. Try it out at wandb.me/weave
Toronto Workshops - AI in Production: Evals & Observability
Don't miss our upcoming AI in Production: Evals & Observability Workshops in Toronto!
* Two Dates: Sunday and Monday workshops in Toronto.
* Hands-on Learning: Learn to build and evaluate LLM-powered applications with robust observability.
* Expert Guidance: Led by yours truly, Alex Volkov, and featuring Nisten.
* Limited Spots: Registration is still open, but spots are filling up fast! Register for Sunday's workshop here and Monday's workshop here.
Join us to level up your LLM skills and network with the Toronto AI community!
Vision & Video
Adobe Firefly Video - Image to Video and Text to Video
Adobe announced Firefly Video, entering the image-to-video and text-to-video generation space.
* Video Generation: Firefly Video offers both image-to-video and text-to-video capabilities.
* Adobe Ecosystem: Integrates with Adobe's creative suite, providing a powerful tool for video creators.
This release marks Adobe's significant move into the rapidly evolving video generation landscape. Try Firefly Video here.
Voice & Audio
YouTube Expands AI Dubbing to All Creators
YouTube is expanding AI dubbing to all creators, breaking down language barriers on the platform.
* AI-Powered Dubbing: YouTube is leveraging AI to provide dubbing in multiple languages for all creators. "YouTube now expands. AI dubbing in languages to all creators, and that's super cool. So basically no language barriers anymore. AI dubbing is here," I announced.
* Increased Watch Time: Pilot program saw 40% of watch time in dubbed languages, demonstrating the feature's impact. "Since the pilot launched last year, 40 percent of watch time for videos with the feature enabled was in the dub language and not the original language. That's insane!" I highlighted.
* Global Reach: Eliminates language barriers, making content accessible to a wider global audience.
Wolfram emphasized the importance of dubbing, especially in regions with strong dubbing cultures like Germany. "Every movie that comes here is getting dubbed in high quality. And now AI is doing that on YouTube. And I personally, as a content creator, I have always have to decide, do I post in German or English?" This feature is poised to revolutionize content consumption on YouTube. Read more on X.com.
Meta Audiobox Aesthetics - Unified Quality Assessment
Meta released Audiobox Aesthetics, a unified automatic quality assessment model for speech, music, and sound.
* Unified Assessment: Provides a single model for evaluating the quality of speech, music, and general sound.
* Four Key Metrics: Evaluates audio based on Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU).
* Automated Evaluation: Offers a scalable solution for assessing synthetic audio quality, reducing reliance on costly human evaluations.
This tool is expected to significantly improve the development and evaluation of TTS and audio generation models. Access the Paper and Weights on GitHub.
Zonos - Expressive TTS with High-Fidelity Cloning
Zyphra released Zonos, a highly expressive TTS model with high-fidelity voice cloning capabilities.
* Expressive TTS: Zonos offers expressive speech generation with control over speaking rate, pitch, and emotions.
* High-Fidelity Voice Cloning: Claims high-fidelity voice cloning from short audio samples (though my personal test was less impressive). "My own voice clone sounded a little bit like me but not a lot. Ok at least for me, the cloning is really really bad," I admitted on the show.
* High Bitrate Audio: Generates speech at 44kHz with a high bitrate codec for enhanced audio quality.
* Open Source & API: Models are open source, with a commercial API available.
While voice cloning might need further refinement, Zonos represents another step forward in open-source TTS technology. Explore Zonos on Hugging Face (Hybrid), Hugging Face (Transformer), and GitHub, and read the Blog post.
Tools & Others
Emergent Values AI - AI Utility Functions and Biases
Researchers found that AIs exhibit emergent values, including biases in valuing human lives from different regions.
* Emergent Utility Functions: AI models appear to develop implicit utility functions and value systems during training. "Research finds that AI's have expected utility functions for people and other emergent values. And this is freaky," I summarized.
* Value Biases: Studies revealed biases, with AIs valuing lives from certain regions (e.g., Nigeria, Pakistan, India) higher than others (e.g., Italy, France, Germany, UK, US). "Nigerian people, valued as like eight us people. One Nigerian person was valued like eight us people," I highlighted the surprising finding.
* Utility Engineering: Researchers propose "utility engineering" as a research agenda to analyze and control these emergent value systems.
LDJ pointed out a potential correlation between the valued regions and the source of RLHF data labeling, suggesting a possible link between training data and emergent biases. While the study is still debated, it raises important questions about AI value alignment. Read the announcement on X.com and the Paper.
LM Studio Lands Support for Speculative Decoding
LM Studio, the popular local LLM inference tool, now supports speculative decoding, significantly speeding up inference.
* Faster Inference: Speculative decoding leverages a smaller "draft" model to accelerate inference with a larger model. "Speculative decoding finally landed in LM studio, which is dope folks. If you use LM studio, if you don't, you should," I exclaimed.
* Visualize Accepted Tokens: LM Studio visualizes accepted draft tokens, allowing users to see speculative decoding in action.
* Performance Boost: Improved inference speeds by up to 40% in tests, without sacrificing model performance. "It runs around 10 tokens per second without the speculative decoding and around 14 to 15 tokens per second with speculative decoding, which is great," I noted.
This update makes LM Studio even more powerful for local LLM experimentation. See the announcement on X.com.
Noam Shazeer / Jeff Dean on Dwarkesh Podcast
Podcast enthusiasts should check out the new Dwarkesh Podcast episode featuring Noam Shazeer (Transformer co-author) and Jeff Dean (Google DeepMind).
* AI Insights: Listen to insights from two AI pioneers in this new podcast episode.
Tune in to hear from these influential figures in the AI world. Find the announcement on X.com.
What a week, folks! From rogue AI analyzing my personal life to OpenAI shaking up the roadmap and tiny models conquering math, the AI world continues to deliver surprises. Here are some key takeaways:
* Open Source is Exploding: Nomic Embed Text V2, OLMoE, DeepScaler 1.5B, and ModernBERT Instruct are pushing the boundaries of what's possible with open, accessible models.
* Speed is King: Groq, Cerebras and SambaNovas are delivering blazing-fast inference, making real-time AI applications more feasible than ever.
* Reasoning is Evolving: DeepScaler 1.5B's success demonstrates the power of RL for even small models, and OpenAI and Anthropic are moving towards unified models with integrated reasoning.
* Privacy Matters: AllenAI's OLMoE highlights the growing importance of on-device AI for data privacy.
* The AI Landscape is Shifting: OpenAI's roadmap announcement signals a move towards simpler, more integrated AI experiences, while government officials are taking a stronger stance on AI policy.
Stay tuned to ThursdAI for the latest updates, and don't forget to subscribe to the newsletter for all the links and details! Next week, I'll be in New York, so expect a special edition of ThursdAI from the AI Engineer floor.
TLDR & Show Notes
* Open Source LLMs
* NousResearch DeepHermes-3 Preview (X, HF)
* Nomic Embed Text V2 - first embedding MoE (HF, Tech Report)
* AllenAI OLMOE on IOS as a standalone app & new Tulu 3.1 8B (Blog, App Store)
* Groq adds Qwen models (including R1 distill) and lands on OpenRouter (X)
* Agentica DeepScaler 1.5B beats o1-preview on math using RL for $4500 (X, HF, WandB)
* ModernBert can be instructed (though encoder only) to do general tasks (X)
* LMArena releases a dataset of 100K votes with human preferences (X, HF)
* SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)
* Big CO LLMs + APIs
* RIP GPT-5 and o3 - OpenAI announces a public roadmap (X)
* OpenAI released Model Spec v2 (Github, Blog)
* VP Vance Speech at AI Summit in Paris (full speech)
* Cerebras now powers Perplexity with >1200t/s (X)
* Anthropic Claude incoming, will be combined LLM + reasoning (The Information)
* This weeks Buzz
* We've added dataset editing in the UI (X)
* 2 workshops in Toronto, Sunday and Monday
* Vision & Video
* Adobe announces firefly video (img2video and txt2video) (try it)
* Voice & Audio
* Youtube to expand AI Dubbing to all creators (X)
* Meta Audiobox Aesthetics - Unified Automatic Quality Assessment for Speech, Music, and Sound (Paper, Weights)
* Zonos, a highly expressive TTS model with high fidelity voice cloning (Blog, HF,HF, Github)
* Tools & Others
* Emergent Values AI - Research finds that AI's have expected utility functions (X, paper)
* LMStudio lands support for Speculative Decoding (X)
* Noam Shazeer / Jeff Dean on Dwarkesh podcast (X)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news
7 feb· ThursdAI - The top AI news from the past week
What's up friends, Alex here, back with another ThursdAI hot off the presses.
Hold onto your hats because this week was another whirlwind of AI breakthroughs, mind-blowing demos, and straight-up game-changers. We dove deep into OpenAI's new "Deep Research" agent – and let me tell you, it's not just hype, it's legitimately revolutionary. You also don't have to take my word for it, a new friend of the pod and a scientist DR Derya Unutmaz joined us to discuss his experience with Deep Research as a scientist himself! You don't want to miss this conversation!
We also unpack Google's Gemini 2.0 release, including the blazing-fast Flash Lite model. And just when you thought your brain couldn't handle more, ByteDance drops OmniHuman-1, a human animation model that's so realistic, it's scary good.
I've also saw maybe 10 more
TLDR & Show Notes
* Open Source LLMs (and deep research implementations)
* Jina Node-DeepResearch (X, Github)
* HuggingFace - OpenDeepResearch (X)
* Deep Agent - R1 -V (X, Github)
* Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India (X, Blog, HF)
* Simple Scaling - S1 - R1 (Paper)
* Mergekit updated -
* Big CO LLMs + APIs
* OpenAI ships o3-mini and o3-mini High + updates thinking traces (Blog, X)
* Mistral relaunches LeChat with Cerebras for 1000t/s (Blog)
* OpenAI Deep Research - the researching agent that uses o3 (X, Blog)
* Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (Blog)
* Anthropic Constitutional Classifiers - announced a universal jailbreak prevention (Blog, Try It)
* Cloudflare to protect websites from AI scraping (News)
* HuggingFace becomes the AI Appstore (link)
* This weeks Buzz - Weights & Biases updates
* AI Engineer workshop (Saturday 22)
* Tinkerers Toronto workshops (Sunday 23 , Monday 24)
* We released a new Dataset editor feature (X)
* Audio and Sound
* KyutAI open sources Hibiki - simultaneous translation models (Samples, HF)
* AI Art & Diffusion & 3D
* ByteDance OmniHuman-1 - unparalleled Human Animation Models (X, Page)
* Pika labs adds PikaAdditions - adding anything to existing video (X)
* Google added Imagen3 to their API (Blog)
* Tools & Others
* Mistral Le Chat has ios an and adroid apps now (X)
* CoPilot now has agentic workflows (X)
* Replit launches free apps agent for everyone (X)
* Karpathy drops a new 3 hour video on youtube (X, Youtube)
* OpenAI canvas links are now shareable (like Anthropic artifacts) - (example)
* Show Notes & Links
* Guest of the week - Dr Derya Umnutaz - talking about Deep Research
* He's examples of Ehlers-Danlos Syndrome (ChatGPT), (ME/CFS) Deep Research, Nature article about Deep Reseach with Derya comments
* Hosts
* Alex Volkov - AI Evangelist & Host @altryne
* Wolfram Ravenwolf - AI Evangelist @WolframRvnwlf
* Nisten Tahiraj - AI Dev at github.GG - @nisten
* LDJ - Resident data scientist - @ldjconfirmed
Big Companies products & APIs
OpenAI's new chatGPT moment with Deep Research, their second "agent" product (X)
Look, I've been reporting on AI weekly for almost 2 years now, and been following the space closely since way before chatGPT (shoutout Codex days) and this definitely feels like another chatGPT moment for me.
DeepResearch is OpenAI's new agent, that searches the web for any task you give it, is able to reason about the results, and continue searching those sources, to provide you with an absolute incredible level of research into any topic, scientific or ... the best taqueria in another country.
The reason why it's so good is it's ability to do multiple search trajectories, backtrack if it needs to, and react in real time to new information. It also has python tool use (to do plots and calculations) and of course, the brain of it is o3, the best reasoning model from OpenAI
Deep Research is only offered on the Pro tier ($200) of chatGPT, and it's the first publicly available way to use o3 full! and boy, does it deliver!
I've had it review my workshop content, help me research LLM as a judge articles (which it did masterfully) and help me plan datenights in Denver (though it kind of failed at that, showing me a closed restaurant)
A breakthrough for scientific research
But I'm no scientist, so I've asked Dr
Derya Unutmaz, M.D.
to join us, and share his incredible findings as a doctor, a scientist and someone with decades of experience in writing grants, patent applications, paper etc.
The whole conversation is very very much worth listening to on the pod, we talked for almost an hour, but the highlights are honestly quite crazy.
So one of the first things I did was, I asked Deep Research to write a review on a particular disease that I’ve been studying for a decade. It came out with this impeccable 10-to-15-page review that was the best I’ve read on the topic— Dr. Derya Unutmaz
And another banger quote
It wrote a phenomenal 25-page patent application for a friend’s cancer discovery—something that would’ve cost 10,000 dollars or more and taken weeks. I couldn’t believe it. Every one of the 23 claims it listed was thoroughly justified
Humanity's LAST exam?
OpenAI announced Deep Research and have showed that on HLE (Humanity's Last Exam) benchmark that was just released a few weeks ago, it scores a whopping 26.6 percent! When HLE was released (our coverage here) all the way back at ... checks notes... January 23 or this year! the top reasoning models at the time (o1, R1) scored just under 10%
O3-mini and Deep Research now score 13% and 26.6% respectively, which means both that AI is advancing like crazy, but also.. that maybe calling this "last exam" was a bit premature? 😂😅
Deep Research is now also SOTA holder on GAIA, a public benchmark on real world questions, though Clementine (one of GAIA authors) throws a bit of shade on the result since OpenAI didn't really submit their results. Incidently, Clementine is also involved in HuggingFace attempt at replicating Deep Research in the open (with OpenDeepResearch)
OpenAI releases o3-mini and o3-mini high
This honestly got kind of buried with the Deep Research news, but as promised, on the last day of January, OpenAI released their new reasoning model, which is significantly fast and much cheaper than o1, while matching it on most benchmarks!
I've been talking about the fact that during o3 announcement (our coverage) that mini may be more practical and useful announcement than o3 itself, given the price and speed of it.
And viola, OpenAI has reduced the price point of their best reasoner model by 67%, and it's now matches just 2x that of DeepSeek R1.
Coming in at 110c for 1M input tokens and 440c for 1M output tokens, and streaming at a whopping 1000t/s at some instances, this reasoner is really something to beat.
Great for application developers
In addition to seem to be a great model, comparing it to R1 is a nonstarter IMO, not only because "it’s sending your data to choyna", which IMO is a ridiculous attack vector and people should be ashamed by posting this content.
o3-mini supports all of the nice API things that OpenAI has, like tool use, structured outputs, developer messages and streaming. The ability to set the reasoning effort is also interesting for applications!
Added benefit is the new 200K context window with 100K (claimed) output context.
It's also really really fast, while R1 availability grows, as it gets hosted on more and more US based providers, none of them are offering the full context window at these token speeds.
o3-mini-high?!
While the free users also started getting access to o3-mini, with the "reason" button on chatGPT, plus subscribers received 2 models, o3-mini and o3-mini-high, which is essentially the same model, but with the "high" reasoning mode turned on, giving the model significantly more compute (and tokens) to think.
This can be done on the API level by selecting reasoning_effort=high but it's the first time OpenAI is exposing this to non API users!
One highlight for me is, just how MANY tokens o3-mini high things through. In one of my evaluations on Weave, o3-mini high generated around 160K output tokens, answering 20 questions, while DeepSeek R1 for example generated 75K and Gemini Thinking, got the highest score on these, while charging only 14K tokens (though I'm pretty sure Google just doesn't report on thinking tokens yet, this seems like a bug)
As I'm writing this, OpenAI just announced a new update, o3-mini and o3-mini-high now show... "updated" reasoning traces!
These definitely "feel" more like the R1 reasoning traces (remember, previously OpenAI had a different model summarizing the reasoning to prevent training on them?) but they are not really the RAW ones (confirmed)
Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (X, Blog)
Congrats to our friends at Google for 2.0 👏 Google finally put all the experimental models under one 2.0 umbrella, giving us Gemini 2.0, Gemini 2.0 Flash and a new model!
They also introduced Gemini 2.0 Flash-lite, a crazy fast and cheap model that performs similarly to Flash 1.5. The rate limits on Flash-lite are twice as high as the regular Flash, making it incredibly useful for real-time applications.
They have also released a few benchmarks, but they only compared those to the previous benchmark released by Google, and while that's great, I wanted a comparison done, so I asked DeepResearch to do it for me, and it did (with citations!)
Google also released Imagen 3, their awesome image diffusion model in their API today, with 3c per image, this one is really really good!
Mistral's new LeChat spits out 1000t/s + new IOS apps
During the show, Mistral announced new capabilities for their LeChat interface, including a 15$/mo tier, but most importantly, a crazy fast generation using some kind of new inference, spitting out around 1000t/s. (Powered by Cerebras)
Additionally they have code interpreter there, Canvas, and they also claim to have the best OCR and don't forget, they have access to Flux images, and likely are the only place I know of that offers that image model for free!
Finally, they've released native mobile apps! (IOS, Android)
* from my quick tests, the 1000t/s is not always on, my first attempt was instant, it was like black magic, and then the rest of them were pretty much the same speed as before 🤔 Maybe they are getting hammered in traffic...
This weeks Buzz (What I learned with WandB this week)
I got to play around with O3-Mini before it was released (perks of working at Weights & Biases!), and I used Weave, our observability and evaluation framework, to analyze its performance. The results were… interesting.
* Latency and Token Count: O3-Mini High's latency was six times longer than O3-Mini Low on a simple reasoning benchmark (92 seconds vs. 6 seconds). But here's the kicker: it didn't even answer more questions correctly! And the token count? O3-Mini High used half a million tokens to answer 20 questions three times. That's… a lot.
* Weave Leaderboards: Nisten got super excited about using Weave's leaderboard feature to benchmark models. He realized it could solve a real problem in the open-source community – providing a verifiable and transparent way to share benchmark results. (really, we didnt' rehearse this!)
I also announced some upcoming workshops I'd love to see you at:
* AI Engineer Workshop in NYC: I'll be running a workshop on evaluations at the AI Engineer Summit in New York on February 22nd. Come say hi and learn about evals!
* AI Tinkerers Workshops in Toronto: I'll also be doing two workshops with AI Tinkerers in Toronto on February 23rd and 24th.
ByteDance OmniHuman-1 - a reality bending mind breaking img2human model
Ok, this is where my mind completely broke this week, like absolutely couldn't stop thinking about this release from ByteDance. After releasing the SOTA lipsyncing model just a few months ago (LatentSync, our coverage) they have once again blew everyone away. This time with a img2avatar model that's unlike anything we've ever seen.
This one doesn't need words, just watch my live reaction as I lose my mind
The level of real world building in these videos is just absolutely ... too much? The piano keys moving, there's a video of a woman speaking in the microphone, and behind her, the window has reflections of cars and people moving!
The thing that most blew me away upon review was the Niki Glazer video, with shiny dress and the model almost perfectly replicating the right sources of light.
Just absolute sorcery!
The authors confirmed that they don't have any immediate plans to release this as a model or even a product, but given the speed of open source, we'll get this within a year for sure! Get ready
Open Source LLMs (and deep research implementations)
This week wasn't massive for open-source releases in terms of entirely new models, but the ripple effects of DeepSeek's R1 are still being felt. The community is buzzing with attempts to replicate and build upon its groundbreaking reasoning capabilities. It feels like everyone is scrambling to figure out the "secret sauce" behind R1's "aha moment," and we're seeing some fascinating results.
Jina Node-DeepResearch and HuggingFace OpenDeepResearch
The community wasted no time trying to replicate OpenAI's Deep Research agent.
* Jina AI released "Node-DeepResearch" (X, Github), claiming it follows the "query, search, read, reason, repeat" formula. As I mentioned on the show, "I believe that they're wrong" about it being just a simple loop. O3 is likely a fine-tuned model, but still, it's awesome to see the open-source community tackling this so quickly!
* Hugging Face also announced "OpenDeepResearch" (X), aiming to create a truly open research agent. Clementine Fourrier, one of the authors behind the GAIA benchmark (which measures research agent capabilities), is involved, so this is definitely one to watch.
Deep Agent - R1 -V: These folks claim to have replicated DeepSeek R1's "aha moment" – where the model realizes its own mistakes and rethinks its approach – for just $3! (X, Github)
As I said on the show, "It's crazy, right? Nothing costs $3 anymore. Like it's half a coffee in Starbucks." They even claim you can witness this "aha moment" in a VLM. Open source is moving fast.
Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India: This Indian AI lab released a whole suite of models, including an improved LLM (Krutim 2), a VLM (Chitrarth 1), a speech-language model (Dhwani 1), an embedding model (Vyakhyarth 1), and a translation model (Krutrim Translate 1). (X, Blog, HF) They even developed a benchmark called "BharatBench" to evaluate Indic AI performance.
However, the community was quick to point out some… issues. As Harveen Singh Chadha pointed out on X, it seems like they blatantly copied IndicTrans, an MIT-licensed model, without even mentioning it. Not cool, Krutim. Not cool.
AceCoder: This project focuses on using reinforcement learning (RL) to improve code models. (X) They claim to have created a pipeline to automatically generate high-quality, verifiable code training data.
They trained a reward model (AceCode-RM) that significantly boosts the performance of Llama-3.1 and Qwen2.5-coder-7B. They even claim you can skip SFT training for code models by using just 80 steps of R1-style training!
Simple Scaling - S1 - R1: This paper (Paper) showcases the power of quality over quantity. They fine-tuned Qwen2.5-32B-Instruct on just 1,000 carefully curated reasoning examples and matched the performance of o1-preview!
They also introduced a technique called "budget forcing," allowing the model to control its test-time compute and improve performance. As I mentioned, Niklas Mengenhoff, who worked at Allen and was previously on the show, is involved. This is one to really pay attention to – it shows that you don't need massive datasets to achieve impressive reasoning capabilities.
Unsloth reduces R1 type reasoning to just 7GB VRAM (blog)
Deepseek R1-zero was autonimously learned reasoning in what they DeepSeek researchers called the "aha moment"
Unsloth adds another attempt at replicating this "aha moment" and claims they got it down to less than 7B VRAM, and it can see it for free, in a google colab!
This magic could be recreated through GRPO, a RL algorithm that optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization (PPO) which relies on a value function
How it works:1. The model generates groups of responses.2. Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.3 . The average score of the group is computed.4. Each response's score is compared to the group average.5. The model is reinforced to favor higher-scoring responses.
Tools
A few new and interesting tools were released this week as well:
* Replit rebuilt and released their replit agents in an IOS app and released it free for many users. It can now build mini apps for you on the fly! (Replit)
* Mistral has ios / android apps with the new release of LeChat (X)
* Molly Cantillon released RPLY, which sits on your mac, and drafts replies to your messages. I installed it during writing this newsletter, and I did not expect it to hit this hard, it reviewed and summarized my texting patterns to "sound like me" and the models sit on device as well. Very very well crafted tool and the best thing it runs models on device if you want!
* Github Copilot announced agentic workflows and next line editing, which are cursor features. To try them out you have to download VSCode insiders. They also added Gemini 2.0 (Blog)
The AI field moves SO fast, I had to update the content of the newsletter around 5 times while writing it as new things kept getting released!
This was a Banger week that started with o3-mini and deep research, continued with Gemini 2.0 and OmniHuman and "ended" with Mistral x Cerebras, Github copilot agents, o3-mini updated COT reasoning traces and a bunch more!
AI doesn't stop, and we're here weekly to cover all of this, and give you guys the highlights, but also go deep!
Really appreciate Derya's appearance on the show this week, please give him a follow and see you guys next week!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news
30 jan· ThursdAI - The top AI news from the past week
Hey folks, Alex here 👋
It’s official—grandmas (and the entire stock market) now know about DeepSeek. If you’ve been living under an AI rock, DeepSeek’s new R1 model just set the world on fire, rattling Wall Street (causing the biggest monetary loss for any company, ever!) and rocketing to #1 on the iOS App Store. This week’s ThursdAI show took us on a deep (pun intended) dive into the dizzying whirlwind of open-source AI breakthroughs, agentic mayhem, and big-company cat-and-mouse announcements. Grab your coffee (or your winter survival kit if you’re in Canada), because in true ThursdAI fashion, we’ve got at least a dozen bombshells to cover—everything from brand-new Mistral to next-gen vision models, new voice synthesis wonders, and big moves from Meta and OpenAI.
We’re also talking “reasoning mania,” as the entire industry scrambles to replicate, dethrone, or ride the coattails of the new open-source champion, R1. So buckle up—because if the last few days are any indication, 2025 is officially the Year of Reasoning (and quite possibly, the Year of Agents, or both!)
Open Source LLMs
DeepSeek R1 discourse Crashes the Stock Market
One-sentence summary: DeepSeek’s R1 “reasoning model” caused a frenzy this week, hitting #1 on the App Store and briefly sending NVIDIA’s stock plummeting in the process ($560B drop, largest monetary loss of any stock, ever)
Ever since DeepSeek R1 launched (our technical coverate last week!), the buzz has been impossible to ignore—everyone from your mom to your local barista has heard the name. The speculation? DeepSeek’s new architecture apparently only cost $5.5 million to train, fueling the notion that high-level AI might be cheaper than Big Tech claims. Suddenly, people wondered if GPU manufacturers like NVIDIA might see shrinking demand, and the stock indeed took a short-lived 17% tumble. On the show, I joked, “My mom knows about DeepSeek—your grandma probably knows about it, too,” underscoring just how mainstream the hype has become.
Not everyone is convinced the cost claims are accurate. Even Dario Amodei of Anthropic weighed in with a blog post arguing that DeepSeek’s success increases the case for stricter AI export controls.
Public Reactions
* Dario Amodei’s blogIn “On DeepSeek and Export Controls,” Amodei argues that DeepSeek’s efficient scaling exemplifies why democratic nations need to maintain a strategic leadership edge—and enforce export controls on advanced AI chips. He sees Chinese breakthroughs as proof that AI competition is global and intense.
* OpenAI Distillation EvidenceOpenAI mentioned it found “distillation traces” of GPT-4 inside R1’s training data. Hypocrisy or fair game? On ThursdAI, the panel mused that “everyone trains on everything,” so perhaps it’s a moot point.
* Microsoft ReactionMicrosoft wasted no time, swiftly adding DeepSeek to Azure—further proof that corporations want to harness R1’s reasoning power, no matter where it originated.
* Government reactedEven officials in the government, David Sacks, US incoming AI & Crypto czar, discussed the fact that DeepSeek did "distillation" using the term somewhat incorrectly, and presidet Trump was asked about it.
* API OutagesDeepSeek’s own API has gone in and out this week, apparently hammered by demand (and possibly DDoS attacks). Meanwhile, GPU clouds like Groq are showing up to accelerate R1 at 300 tokens/second, for those who must have it right now.
We've seen so many bad takes on the topic, from seething cope takes, to just gross misunderstandings from gov officials confusing the ios App with the OSS models, folks throwing conspiracy theories into the mix, claiming that $5.5M sum was a PsyOp. The fact of the matter is, DeepSeek R1 is an incredible model, and is now powering (just a week later), multiple products (more on this below) and experiences already, while pushing everyone else to compete (and give us reasoning models!)
Open Thoughts Reasoning Dataset
One-sentence summary: A community-led effort, “Open Thoughts,” released a new large-scale dataset (OpenThoughts-114k) of chain-of-thought reasoning data, fueling the open-source drive toward better reasoning models.
Worried about having enough labeled “thinking” steps to train your own reasoner? Fear not. The OpenThoughts-114k dataset aggregates chain-of-thought prompts and responses—114,000 of them—for building or fine-tuning reasoning LLMs. It’s now on Hugging Face for your experimentation pleasure. The ThursdAI panel pointed out how crucial these large, openly available reasoning datasets are. As Wolfram put it, “We can’t rely on the big labs alone. More open data means more replicable breakouts like DeepSeek R1.”
Mistral Small 2501 (24B)
One-sentence summary: Mistral AI returns to the open-source spotlight with a 24B model that fits on a single 4090, scoring over 81% on MMLU while under Apache 2.0.
Long rumored to be “going more closed,” Mistral AI re-emerged this week with Mistral-Small-24B-Instruct-2501—an Apache 2.0 licensed LLM that runs easily on a 32GB VRAM GPU. That 81% MMLU accuracy is no joke, putting it well above many 30B–70B competitor models. It was described as “the perfect size for local inference and a real sweet spot,” noting that for many tasks, 24B is “just big enough but not painfully heavy.” Mistral also finally started comparing themselves to Qwen 2.5 in official benchmarks—a big shift from their earlier reluctance, which we applaud!
Berkeley TinyZero & RAGEN (R1 Replications)
One-sentence summary: Two separate projects (TinyZero and RAGEN) replicated DeepSeek R1-zero’s reinforcement learning approach, showing you can get “aha” reasoning moments with minimal compute.
If you were wondering whether R1 is replicable: yes, it is. Berkeley’s TinyZero claims to have reproduced the core R1-zero behaviors for $30 using a small 3B model. Meanwhile, the RAGEN project aims to unify RL + LLM + Agents with a minimal codebase. While neither replication is at R1-level performance, they demonstrate how quickly the open-source community pounces on new methods. “We’re now seeing those same ‘reasoning sparks’ in smaller reproductions,” said Nisten. “That’s huge.”
Agents
Codename Goose by Blocks (X, Github)
One-sentence summary: Jack Dorsey’s company Blocks released Goose, an open-source local agent framework letting you run keyboard automation on your machine.
Ever wanted your AI to press keys and move your mouse in real time? Goose does exactly that with AppleScript, memory extensions, and a fresh approach to “local autonomy.” On the show, I tried Goose, but found it occasionally “went rogue, trying to delete my WhatsApp chats.” Security concerns aside, Goose is significant: it’s an open-source playground for agent-building. The plugin system includes integration with Git, Figma, a knowledge graph, and more. If nothing else, Goose underscores how hot “agentic” frameworks are in 2025.
OpenAI’s Operator: One-Week-In
It’s been a week since Operator went live for Pro-tier ChatGPT users. “It’s the first agent that can run for multiple minutes without bugging me every single second,”. Yet it’s still far from perfect—captchas, login blocks, and repeated confirmations hamper tasks. The potential, though, is enormous: “I asked Operator to gather my X.com bookmarks and generate a summary. It actually tried,” I shared, “but it got stuck on three links and needed constant nudges.” Simon Willison added that it’s “a neat tech demo” but not quite a productivity boon yet. Next steps? Possibly letting the brand-new reasoning models (like O1 Pro Reasoning) do the chain-of-thought under the hood.
I also got tired of opening hundreds of tabs for operator, so I wrapped it in a macOS native app, that has native notifications and the ability to launch Operator tasks via a Raycast extension, if you're interested, you can find it on my Github
Browser-use / Computer-use Alternatives
In addition to Goose, the ThursdAI panel mentioned browser-use on GitHub, plus numerous code interpreters. So far, none blow minds in reliability. But 2025 is evidently “the year of agents.” If you’re itching to offload your browsing or file editing to an AI agent, expect to tinker, troubleshoot, and yes, babysit. The show consensus? “It’s not about whether agents are coming, it’s about how soon they’ll become truly robust,” said Wolfram.
Big CO LLMs + APIs
Alibaba Qwen2.5-Max (& Hidden Video Model) (Try It)
One-sentence summary: Alibaba’s Qwen2.5-Max stands toe-to-toe with GPT-4 on some tasks, while also quietly rolling out video-generation features.
While Western media fixates on DeepSeek, Alibaba’s Qwen team quietly dropped the Qwen2.5-Max MoE model. It clocks in at 69% on MMLU-Pro—beating some OpenAI or Google offerings—and comes with a 1-million-token context window. And guess what? The official Chat interface apparently does hidden video generation, though Alibaba hasn’t publicized it in the English internet.
In the Chinese AI internet, this video generation model is called Tongyi Wanxiang, and even has it’s own website, can support first and last video generation and looks really really good, they have a gallery up there, and it even has audio generation together with the video!
This one was an img2video, but the movements are really natural!
Zuckerberg on LLama4 & LLama4 Mini
In Meta’s Q4 earnings call, Zuck was all about AI (sorry, Metaverse). He declared that LLama4 is in advanced training, with a smaller “LLama4 Mini” finishing pre-training. More importantly, a “reasoning model” is in the works, presumably influenced by the mania around R1. Some employees had apparently posted on Blind about “Why are we paying billions for training if DeepSeek did it for $5 million?” so the official line is that Meta invests heavily for top-tier scale.
Zuck also doubled down on saying "Glasses are the perfect form factor for AI" , to which I somewhat agree, I love my Meta Raybans, I just wished they were integrated into the ios more.
He also boasted about their HUGE datacenters, called Mesa, spanning the size of Manhattan, being built for the next step of AI.
(Nearly) Announced: O3-Mini
Right before the ThursdAI broadcast, rumors swirled that OpenAI might reveal O3-Mini. It’s presumably GPT-4’s “little cousin” with a fraction of the cost. Then…silence. Sam Altman also mentioned they would be bringing o3-mini by end of January, but maybe the R1 crazyness made them keep working on it and training it a bit more? 🤔
In any case, we'll cover it when it launches.
This Week’s Buzz
We're still the #1 spot on Swe-bench verified with W&B programmer, and our CTO, Shawn Lewis, chatted with friends of the pod Swyx and Alessio about it! (give it a listen)
We have two upcoming events:
* AI.engineer in New York (Feb 20–22). Weights & Biases is sponsoring, and I will broadcast ThursdAI live from the summit. If you snagged a ticket, come say hi—there might be a cameo from the “Chef.”
* Toronto Tinkerer Workshops (late February) in the University of Toronto. The Canadian AI scene is hot, so watch out for sign-ups (will add them to the show next week)
Weights & Biases also teased more features for LLM observability (Weave) and reminded folks of their new suite of evaluation tools. “If you want to know if your AI is actually better, you do evals,” Alex insisted. For more details, check out wandb.me/weave or tune into the next ThursdAI.
Vision & Video
DeepSeek - Janus Pro - multimodal understanding and image gen unified (1.5B & 7B)
One-sentence summary: Alongside R1, DeepSeek also released Janus Pro, a unified model for image understanding and generation (like GPT-4’s rumored image abilities).
DeepSeek apparently never sleeps. Janus Pro is MIT-licensed, 7B parameters, and can both parse images (SigLIP) and generate them (LlamaGen). The model outperforms DALL·E 3 and SDXL! on some internal benchmarks—though at a modest 384×384 resolution.
NVIDIA’s Eagle 2 Redux
One-sentence summary: NVIDIA re-released the Eagle 2 vision-language model with 4K resolution support, after mysteriously yanking it a week ago.
Eagle 2 is back, boasting multi-expert architecture, 16k context, and high-res video analysis. Rumor says it competes with big 70B param vision models at only 9B. But it’s overshadowed by Qwen2.5-VL (below). Some suspect NVIDIA is aiming to outdo Meta’s open-source hold on vision—just in time to keep GPU demand strong.
Qwen 2.5 VL - SOTA oss vision model is here
One-sentence summary: Alibaba’s Qwen 2.5 VL model claims state-of-the-art in open-source vision, including 1-hour video comprehension and “object grounding.”
The Qwen team didn’t hold back: “It’s the final boss for vision,” joked Nisten. Qwen 2.5 VL uses advanced temporal modeling for video and can handle complicated tasks like OCR or multi-object bounding boxes.
Featuring advances in precise object localization, video temporal understanding and agentic capabilities for computer, this is going to be the model to beat!
Voice & Audio
YuE 7B (Open “Suno”)
Ever dream of building the next pop star from your code editor? YuE 7B is your ticket. This model, now under Apache 2.0, supports chain-of-thought creation of structured songs, multi-lingual lyrics, and references. It’s slow to infer, but it’s arguably the best open music generator so far in the open source
What's more, they have changed the license to apache 2.0 just before we went live, so you can use YuE everywhere!
Refusion Fuzz
Refusion, a new competitor to paid audio models like Suno and Udio, launched “Fuzz,” offering free music generation online until GPU meltdown.
If you want to dabble in “prompt to jam track” without paying, check out Refusion Fuzz. Will it match the emotional nuance of premium services like 11 Labs or Hauio? Possibly not. But hey, free is free.
Tools (that have integrated R1)
Perplexity with R1
In the perplexity.ai chat, you can choose “Pro with R1” if you pay for it, harnessing R1’s improved reasoning to parse results. For some, it’s a major upgrade to “search-based question answering.” Others prefer it to paying for O1 or GPT-4.
I always check Perplexity if it knows what the latest episode of ThursdAI was, and it's the first time it did a very good summary! I legit used it to research the show this week! It's really something.
Meanwhile, Exa.ai also integrated a “DeepSeek Chat” for your agent-based workflows. Like it or not, R1 is everywhere.
Krea.ai with DeepSeek
Our friends at Krea, an AI art tool aggregator, also hopped on the R1 bandwagon for chat-based image searching or generative tasks.
Conclusion
Key Takeaways
* DeepSeek’s R1 has massive cultural reach, from #1 apps to spooking the stock market.
* Reasoning mania is upon us—everyone from Mistral to Meta wants a piece of the logic-savvy LLM pie.
* Agentic frameworks like Goose, Operator, and browser-use are proliferating, though they’re still baby-stepping through reliability issues.
* Vision and audio get major open-source love, with Janus Pro, Qwen 2.5 VL, YuE 7B, and more reshaping multimodality.
* Big Tech (Meta, Alibaba, OpenAI) is forging ahead with monster models, multi-billion-dollar projects, and cross-country expansions in search of the best reasoning approaches.
At this point, it’s not even about where the next big model drop comes from; it’s about how quickly the entire ecosystem can adopt (or replicate) that new methodology.
Stay tuned for next week’s ThursdAI, where we’ll hopefully see new updates from OpenAI (maybe O3-Mini?), plus the ongoing race for best agent. Also, catch us at AI.engineer in NYC if you want to talk shop or share your own open-source success stories. Until then, keep calm and carry on training.
TLDR
* Open Source LLMs
* DeepSeek Crashes the Stock Market: Did $5.5M training or hype do it?
* Open Thoughts Reasoning Dataset OpenThoughts-114k (X, HF)
* Mistral Small 2501 (24B, Apache 2.0) (HF)
* Berkeley TinyZero & RAGEN (R1-Zero Replications) (Github, WANDB)
* Allen Institute - Tulu 405B (Blog, HF)
* Agents
* Goose by Blocks (local agent framework) - (X, Github)
* Operator (OpenAI) – One-Week-In (X)
* Browser-use - oss version of Operator (Github)
* Big CO LLMs + APIs
* Alibaba Qwen2.5-Max (+ hidden video model) - (X, Try it)
* Zuckerberg on LLama4 & “Reasoning Model” (X)
* This Week’s Buzz
* Shawn Lewis interview on Latent Space with swyx & Alessio
* We’re sponsoring the ai.engineer upcoming summit in NY (Feb 19-22), come say hi!
* After that, we’ll host 2 workshops with AI Tinkerers Toronto (Feb 23-24), make sure you’re signed up to Toronto Tinkerers to receive the invite (we were sold out quick last time!)
* Vision & Video
* DeepSeek Janus Pro - 1.5B and 7B (Github, Try It)
* NVIDIA Eagle 2 (Paper, Model, Demo)
* Alibaba Qwen 2.5 VL (Project, HF, Github, Try It)
* Voice & Audio
* Yue 7B (Open Suno) - (Demo, HF, Github)
* Refusion Fuzz (free for now)
* Tools
* Perplexity with R1 (choose Pro with R1)
* Exa integrated R1 for free (demo)
* Participants
* Alex Volkov (@altryne)
* Wolfram Ravenwolf (@WolframRvnwlf)
* Nisten Tahiraj (@nisten )
* LDJ (@ldjOfficial)
* Simon Willison (@simonw)
* W&B Weave (@weave_wb)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news
24 jan· ThursdAI - The top AI news from the past week
What a week, folks, what a week! Buckle up, because ThursdAI just dropped, and this one's a doozy. We're talking seismic shifts in the open source world, a potential game-changer from DeepSeek AI that's got everyone buzzing, and oh yeah, just a casual $500 BILLION infrastructure project announcement. Plus, OpenAI finally pulled the trigger on "Operator," their agentic browser thingy – though getting it to actually operate proved to be a bit of a live show adventure, as you'll hear.
This week felt like one of those pivotal moments in AI, a real before-and-after kind of thing. DeepSeek's R1 hit the open source scene like a supernova, and suddenly, top-tier reasoning power is within reach for anyone with a Mac and a dream. And then there's OpenAI's Operator, promising to finally bridge the gap between chat and action. Did it live up to the hype? Well, let's just say things got interesting.
As I’m writing this, White House just published that an Executive Order on AI was just signed and published as well, what a WEEK.
Open Source AI Goes Nuclear: DeepSeek R1 is HERE!
Hold onto your hats, open source AI just went supernova! This week, the Chinese Whale Bros – DeepSeek AI, that quant trading firm turned AI powerhouse – dropped a bomb on the community in the best way possible: R1, their reasoning model, is now open source under the MIT license! As I said on the show, "Open source AI has never been as hot as this week."
This isn't just a model, folks. DeepSeek unleashed a whole arsenal: two full-fat R1 models (DeepSeek R1 and DeepSeek R1-Zero), and a whopping six distilled finetunes based on Qwen (1.5B, 7B, 14B, and 32B) and Llama (8B, 72B).
One stat that blew my mind, and Nisten's for that matter, is that DeepSeek-R1-Distill-Qwen-1.5B, the tiny 1.5 billion parameter model, is outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks! "This 1.5 billion parameter model that now does this. It's absolutely insane," I exclaimed on the show. We're talking 28.9% on AIME and 83.9% on MATH. Let that sink in. A model you can probably run on your phone is schooling the big boys in math.
License-wise, it's MIT, which as Nisten put it, "MIT is like a jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now." Basically, do whatever you want with it. Distill it, fine-tune it, build Skynet – it's all fair game.
And the vibes? "Vibes are insane," as I mentioned on the show. Early benchmarks are showing R1 models trading blows with o1-preview and o1-mini, and even nipping at the heels of the full-fat o1 in some areas. Check out these numbers:
And the price? Forget about it. We're talking 50x cheaper than o1 currently. DeepSeek R1 API is priced at $0.14 / 1M input tokens and $2.19 / 1M output tokens, compared to OpenAI's o1 at $15.00 / 1M input and a whopping $60.00 / 1M output. Suddenly, high-quality reasoning is democratized.
LDJ highlighted the "aha moment" in DeepSeek's paper, where they talk about how reinforcement learning enabled the model to re-evaluate its approach and "think more." It seems like simple RL scaling, combined with a focus on reasoning, is the secret sauce. No fancy Monte Carlo Tree Search needed, apparently!
But the real magic of open source is what the community does with it. Pietro Schirano joined us to talk about his "Retrieval Augmented Thinking" (RAT) approach, where he extracts the thinking process from R1 and transplants it to other models. "And what I found out is actually by doing so, you may even like smaller, quote unquote, you know, less intelligent model actually become smarter," Pietro explained. Frankenstein models, anyone? (John Lindquist has a tutorial on how to do it here)
And then there's the genius hack from Voooogel, who figured out how to emulate a "reasoning_effort" knob by simply replacing the "end" token with "Wait, but". "This tricks the model into keeps thinking," as I described it. Want your AI to really ponder the meaning of life (or just 1+1)? Now you can, thanks to open source tinkering.
Georgi Gerganov, the legend behind llama.cpp, even jumped in with a two-line snippet to enable speculative decoding, boosting inference speeds on the 32B model on my Macbook from a sluggish 5 tokens per second to a much more respectable 10-11 tokens per second. Open source collaboration at its finest and it's only going to get better!
Thinking like a Neurotic
Many people really loved the way R1 thinks, and what I found astonishing is that I just sent "hey" and the thinking went into a whole 5 paragraph debate of how to answer, a user on X answered with "this is Woody Allen-level of Neurotic" which... nerd sniped me so hard! I used Hauio Audio (which is great!) and ByteDance latentSync and gave R1 a voice! It's really something when you hear it's inner monologue being spoken out like this!
ByteDance Enters the Ring: UI-TARS Controls Your PC
Not to be outdone in the open source frenzy, ByteDance, the TikTok behemoth, dropped UI-TARS, a set of models designed to control your PC. And they claim SOTA performance, beating even Anthropic's computer use models and, in some benchmarks, GPT-4o and Claude.
UI-TARS comes in 2B, 7B, and 72B parameter flavors, and ByteDance even released desktop apps for Mac and PC to go along with them. "They released an app it's called the UI TARS desktop app. And then, this app basically allows you to Execute the mouse clicks and keyboard clicks," I explained during the show.
While I personally couldn't get the desktop app to work flawlessly (quantization issues, apparently), the potential is undeniable. Imagine open source agents controlling your computer – the possibilities are both exciting and slightly terrifying. As Nisten wisely pointed out, "I would use another machine. These things are not safe to tell people. I might actually just delete your data if you, by accident." Words to live by, folks.
LDJ chimed in, noting that UI-TARS seems to excel particularly in operating system-level control tasks, while OpenAI's leaked "Operator" benchmarks might show an edge in browser control. It's a battle for desktop dominance brewing in open source!
Noting that the common benchmark between Operator and UI-TARS is OSWorld, UI-Tars launched with a SOTA
Humanity's Last Exam: The Benchmark to Beat
Speaking of benchmarks, a new challenger has entered the arena: Humanity's Last Exam (HLE). A cool new unsaturated bench of 3,000 challenging questions across over a hundred subjects, crafted by nearly a thousand subject matter experts from around the globe. "There's no way I'm answering any of those myself. I need an AI to help me," I confessed on the show.
And guess who's already topping the HLE leaderboard? You guessed it: DeepSeek R1, with a score of 9.4%! "Imagine how hard this benchmark is if the top reasoning models that we have right now... are getting less than 10 percent completeness on this," MMLU and Math are getting saturated? HLE is here to provide a serious challenge. Get ready to hear a lot more about HLE, folks.
Big CO LLMs + APIs: Google's Gemini Gets a Million-Token Brain
While open source was stealing the show, the big companies weren't completely silent. Google quietly dropped an update to Gemini Flash Thinking, their experimental reasoning model, and it's a big one. We're talking 1 million token context window and code execution capabilities now baked in!
"This is Google's scariest model by far ever built ever," Nisten declared. "This thing, I don't like how good it is. This smells AGI-ish" High praise, and high concern, coming from Nisten! Benchmarks are showing significant performance jumps in math and science evals, and the speed is, as Nisten put it, "crazy usable." They have enabled the whopping 1M context window for the new Gemini Flash 2.0 Thinking Experimental (long ass name, maybe let's call it G1?) and I agree, it's really really good!
And unlike some other reasoning models cough OpenAI cough, Gemini Flash Thinking shows you its thinking process! You can actually see the chain of thought unfold, which is incredibly valuable for understanding and debugging. Google's Gemini is quietly becoming a serious contender in the reasoning race (especially with Noam Shazeer being responsible for it!)
OpenAI's "Operator" - Agents Are (Almost) Here
The moment we were all waiting for (or at least, I was): OpenAI finally unveiled Operator, their first foray into Level 3 Autonomy - agentic capabilities with ChatGPT. Sam Altman himself hyped it up as "AI agents are AI systems that can do work for you. You give them a task and they go off and do it." Sounds amazing, right?
Operator is built on a new model called CUA (Computer Using Agent), trained on top of GPT-4, and it's designed to control a web browser in the cloud, just like a human would, using screen pixels, mouse, and keyboard. "This is just using screenshots, no API, nothing, just working," one of the OpenAI presenters emphasized.
They demoed Operator booking restaurant reservations on OpenTable, ordering groceries on Instacart, and even trying to buy Warriors tickets on StubHub (though that demo got a little… glitchy). The idea is that you can delegate tasks to Operator, and it'll go off and handle them in the background, notifying you when it needs input or when the task is complete.
As I'm writing these words, I have an Operator running trying to get me some fried rice, and another one trying to book me a vacation with kids over the summer, find some options and tell me what it found.
Benchmarks-wise, OpenAI shared numbers for OSWorld (38.1%) and WebArena (58.1%), showing Operator outperforming previous SOTA but still lagging behind human performance. "Still a way to go," as they admitted. But the potential is massive.
The catch? Operator is initially launching in the US for Pro users only, and even then, it wasn't exactly smooth sailing. I immediately paid the $200/mo to try it out (pro mode didn't convince me, unlimited SORA videos didn't either, operator definitely did, SOTA agents from OpenAI is definitely something I must try!) and my first test? Writing a tweet 😂 Here's a video of that first attempt, which I had to interrupt 1 time.
But hey, it's a "low key research preview" right? And as Sam Altman said, "This is really the beginning of this product. This is the beginning of our step into Agents Level 3 on our tiers of AGI" Agentic ChatGPT is coming, folks, even if it's taking a slightly bumpy route to get here.
BTW, while I'm writing these words, Operator is looking up some vacation options for me and is sending me notifications about them, what a world and we've only just started 2025!
Project Stargate: $500 Billion for AI Infrastructure
If R1 and Operator weren't enough to make your head spin, how about a $500 BILLION "Manhattan Project for AI infrastructure"? That's exactly what OpenAI, SoftBank, and Oracle announced this week: Project Stargate.
"This is insane," I exclaimed on the show. "Power ups for the United States compared to like, other, other countries, like 500 billion commitment!" We're talking about a massive investment in data centers, power plants, and everything else needed to fuel the AI revolution. 2% of the US GDP, according to some estimates!
Larry Ellison even hinted at using this infrastructure for… curing cancer with personalized vaccines. Whether you buy into that or not, the scale of this project is mind-boggling. As LDJ explained, "It seems like it is very specifically for open AI. Open AI will be in charge of operating it. And yeah, it's, it sounds like a smart way to actually kind of get funding and investment for infrastructure without actually having to give away open AI equity."
And in a somewhat related move, Microsoft, previously holding exclusive cloud access for OpenAI, has opened the door for OpenAI to potentially run on other clouds, with Microsoft's approval if "they cannot meet demant". Is AGI closer than we think? Sam Altman himself downplayed the hype, tweeting, "Twitter hype is out of control again. We're not going to deploy AGI next month, nor have we built it. We have some very cool stuff for you, but please chill and cut your expectations a hundred X."
But then he drops Operator and a $500 billion infrastructure bomb in the same week and announces that o3-mini is going to be available for the FREE tier of chatGPT.
Sure, Sam, we're going to chill... yeah right.
This Week's Buzz at Weights & Biases: SWE-bench SOTA!
Time for our weekly dose of Weights & Biases awesomeness! This week, our very own CTO, Shawn Lewis, broke the SOTA on SWE-bench Verified! That's right, W&B Programmer, Shawn's agentic framework built on top of o1, achieved a 64.6% solve rate on this notoriously challenging coding benchmark.
Shawn detailed his journey in a blog post, highlighting the importance of iteration and evaluation – powered by Weights & Biases Weave, naturally. He ran over 1000 evaluations to reach this SOTA result! Talk about eating your own dogfood!
REMOVING BARRIERS TO AMERICAN LEADERSHIP IN ARTIFICIAL INTELLIGENCE - Executive order
Just now as I’m editing the podcast, President Trump signed into effect an executive order for AI, and here are the highlights.
- Revokes existing AI policies that hinder American AI innovation
- Aims to solidify US as global leader in AI for human flourishing, competitiveness, and security
- Directs development of an AI Action Plan within 180 days
- Requires immediate review and revision of conflicting policies
- Directs OMB to revise relevant memos within 60 days
- Preserves agency authority and OMB budgetary functions
- Consistent with applicable law and funding availability
- Seeks to remove barriers and strengthen US AI dominance
This marks such a significant pivot into AI acceleration, removing barriers, acknowledging that AI is a huge piece of our upcoming future and that US really needs to innovate here, become the global leader, and remove regulation and obstacles. The folks that work on this behind the scenes, Sriram Krishan (previously A16Z) and David Sacks, are starting to get into the government and implement those policies, so we’re looking forward to what will come form that!
Vision & Video: Nvidia's Vanishing Eagle 2 & Hugging Face's Tiny VLM
In the world of vision and video, Nvidia teased us with Eagle 2, a series of frontier vision-language models promising 4K HD input, long-context video, and grounding capabilities with some VERY impressive evals. Weights were released, then…yanked. "NVIDIA released Eagle 2 and then yanked it back. So I don't know what's that about," I commented. Mysterious Nvidia strikes again.
On the brighter side, Hugging Face released SmolVLM, a truly tiny vision-language model, coming in at just 256 million and 500 million parameters. "This tiny model that runs in like one gigabyte of RAM or some, some crazy things, like a smart fridge" I exclaimed, impressed. The 256M model even outperforms their previous 80 billion parameter Idefics model from just 17 months ago. Progress marches on, even in tiny packages.
AI Art & Diffusion & 3D: Hunyuan 3D 2.0 is State of the Art
For the artists and 3D enthusiasts, Tencent's Hunyuan 3D 2.0 dropped this week, and it's looking seriously impressive. "Just look at this beauty," I said, showcasing a generated dragon skull. "Just look at this."
Hunyuan 3D 2.0 boasts two models: Hunyuan3D-DiT-v2-0 for shape generation and Hunyuan3D-Paint-v2-0 for coloring. Text-to-3D and image-to-3D workflows are both supported, and the results are, well, see for yourself:
If you're looking to move beyond 2D images, Hunyuan 3D 2.0 is definitely worth checking out.
Tools: ByteDance Clones Cursor with Trae
And finally, in the "tools" department, ByteDance continues its open source blitzkrieg with Trae, a free Cursor competitor. "ByteDance drops Trae, which is a cursor competitor, which is free for now" I announced on the show, so if you don't mind your code being sent to... china somewhere, and can't afford Cursor, this is not a bad alternative!
Trae imports your Cursor configs, supports Claude 3.5 and GPT-4o, and offers a similar AI-powered code editing experience, complete with chat interface and "builder" (composer) mode. The catch? Your code gets sent to a server in China. If you're okay with that, you've got yourself a free Cursor alternative. "If you're okay with your like code getting shared with ByteDance, this is a good option for you," I summarized. Decisions, decisions.
Phew! That was a whirlwind tour through another insane week in AI. From DeepSeek R1's open source reasoning revolution to OpenAI's Operator going live, and Google's million-token Gemini brain, it's clear that the pace of innovation is showing no signs of slowing down.
Open source is booming, agents are inching closer to reality, and the big companies are throwing down massive infrastructure investments. We're accelerating as f**k, and it's only just beginning, hold on to your butts.
Make sure to dive into the show notes below for all the links and details on everything we covered. And don't forget to give R1 a spin – and maybe try out that "reasoning_effort" hack. Just don't blame me if your AI starts having an existential crisis.
And as a final thought, channeling my inner Woody Allen-R1, "Don't overthink too much. enjoy our one. Enjoy the incredible things we received this week from open source."
See you all next week for more ThursdAI madness! And hopefully, by then, Operator will actually be operating. 😉
TL;DR and show notes
* Open Source LLMs
* DeepSeek R1 - MIT licensed SOTA open source reasoning model (HF, X)
* ByteDance UI-TARS - PC control models (HF, Github )
* HLE - Humanity's Last Exam benchmark (Website)
* Big CO LLMs + APIs
* SoftBank, Oracle, OpenAI Stargate Project - $500B AI infrastructure (OpenAI Blog)
* Google Gemini Flash Thinking 01-21 - 1M context, Code execution, Better Evals (X)
* OpenAI Operator - Agentic browser in ChatGPT Pro operator.chatgpt.com
* Anthropic launches citations in API (blog)
* Perplexity SonarPRO Search API and an Android AI assistant (X)
* This weeks Buzz 🐝
* W&B broke SOTA SWE-bench verified (W&B Blog)
* Vision & Video
* HuggingFace SmolVLM - Tiny VLMs - runs even on WebGPU (HF)
* AI Art & Diffusion & 3D
* Hunyuan 3D 2.0 - SOTA open-source 3D (HF)
* Tools
* ByteDance Trae - Cursor competitor (Trae AI: https://trae.ai/)
* Show Notes:
* Pietro Skirano RAT - Retrieval augmented generation (X)
* Run DeepSeek with more “thinking” script (Gist)

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news
17 jan· ThursdAI - The top AI news from the past week
Hey everyone, Alex here 👋
Welcome back, to an absolute banger of a week in AI releases, highlighted with just massive Open Source AI push. We're talking a MASSIVE 4M context window context window model from Hailuo (remember when a jump from 4K to 16K seemed like a big deal?), a 8B omni model that lets you livestream video and glimpses of Agentic ChatGPT?
This week's ThursdAI was jam-packed with so much open source goodness that the big companies were practically silent. But don't worry, we still managed to squeeze in some updates from OpenAI and Mistral, along with a fascinating new paper from Sakana AI on self-adaptive LLMs. Plus, we had the incredible Graham Neubig, from All Hands AI, join us to talk about Open Hands (formerly OpenDevin) and even contributed to our free, LLM Evaluation course on Weights & Biases!
Before we dive in, a friend asked me over dinner, what are the main 2 things that happened in AI in 2024, and this week highlights one of those trends. Most of the Open Source is now from China. This week, we got MiniMax from Hailuo, OpenBMB with a new MiniCPM, InternLM came back and most of the rest were Qwen finetunes. Not to mention DeepSeek. Wanted to highlight this significant narrative change and that this is being done despite the chip export restrictions.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Open Source AI & LLMs
MiniMax-01: 4 Million Context, 456 Billion Parameters, and Lightning Attention
This came absolutely from the left field, given that we've seen no prior LLMs from Haulio, the company previously releasing video models with consistent characters. Dropping a massive 456B mixture of experts model (45B active parameters) with such a long context support in open weights, but also with very significant benchmarks that compete with Gpt-4o, Claude and DeekSeek v3 (75.7 MMLU-pro, 89 IFEval, 54.4 GPQA)
They have trained the model on up to 1M context window and then extended it to 4M with ROPE scaling methods (our coverage of RoPE) during Inference. MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE) with 45B active parameters.
I gotta say, when we started talking about context window, imagining a needle in a haystack graph that shows 4M, in the open source seemed far fetched, though we did say that theoretically, there may not be a limit to context windows. I just always expected that limit to be unlocked by transformers alternative architectures like Mamba or other State Space Models.
Vision, API and Browsing - Minimax-VL-01
It feels like such a well rounded and complete release, that it highlights just how mature company that is behind it. They have also released a vision version of this model, that includes a 300M param Vision Transformer on top (trained with 512B vision language tokens) that features dynamic resolution and boasts very high DocVQA and ChartQA scores.
Not only did these two models were released in open weights, they also launched as a unified API endpoint (supporting up to 1M tokens) and it's cheap! $0.2/1M input and $1.1/1M output tokens! AFAIK this is only the 3rd API that supports this much context, after Gemini at 2M and Qwen Turbo that supports 1M as well.
Surprising web browsing capabilities
You can play around with the model on their website, hailuo.ai which also includes web grounding, which I found quite surprising to find out, that they are beating chatGPT and Perplexity on how fast they can find information that just happened that same day! Not sure what search API they are using under the hood but they are very quick.
8B chat with video model omni-model from OpenBMB
OpenBMB has been around for a while and we've seen consistently great updates from them on the MiniCPM front, but this one takes the cake!
This is a complete omni modal end to end model, that does video streaming, audio to audio and text understanding, all on a model that can run on an iPad!
They have a demo interface that is very similar to the chatGPT demo from spring of last year, and allows you to stream your webcam and talk to the model, but this is just an 8B parameter model we're talking about! It's bonkers!
They are boasting some incredible numbers, and to be honest, I highly doubt their methodology in textual understanding, because, well, based on my experience alone, this model understands less than close to chatGPT advanced voice mode, but miniCPM has been doing great visual understanding for a while, so ChartQA and DocVQA are close to SOTA.
But all of this doesn't matter, because, I say again, just a little over a year ago, Google released a video announcing these capabilities, having an AI react to a video in real time, and it absolutely blew everyone away, and it was FAKED. And this time a year after, we have these capabilities, essentially, in an 8B model that runs on device 🤯
Voice & Audio
This week seems to be very multimodal, not only did we get an omni-modal from OpenBMB that can speak, and last week's Kokoro still makes a lot of waves, but this week there were a lot of voice updates as well
Kokoro.js - run the SOTA open TTS now in your browser
Thanks to friend of the pod Xenova (and the fact that Kokoro was released with ONNX weights), we now have kokoro.js, or npm -i kokoro-js if you will.
This allows you to install and run Kokoro, the best tiny TTS model, completely within your browser, with a tiny 90MB download and it sounds really good (demo here)
Hailuo T2A - Emotional text to speech + API
Hailuo didn't rest on their laurels of releasing a huge context window LLM, they also released a new voice framework (tho not open sourced) this week, and it sounds remarkably good (competing with 11labs)
They have all the standard features like Voice Cloning, but claim to have a way to preserve the emotional undertones of a voice. They also have 300 voices to choose from and professional effects applied on the fly, like acoustics or telephone filters. (Remember, they have a video model as well, so assuming that some of this is to for the holistic video production)
What I specifically noticed is their "emotional intelligence system" that's either automatic or can be selected from a dropdown. I also noticed their "lax" copyright restrictions, as one of the voices that was called "Imposing Queen" sounded just like a certain blonde haired heiress to the iron throne from a certain HBO series.
When I generated a speech worth of that queen, I noticed that the emotion in that speech sounded very much like an actress would read them, and unlike any old TTS, just listen to it in the clip above, I don't remember getting TTS outputs with this much emotion from anything, maybe outside of advanced voice mode! Quite impressive!
This Weeks Buzz from Weights & Biases - AGENTS!
Breaking news from W&B as our CTO just broke SWE-bench Verified SOTA, with his own o1 agentic framework he calls W&B Programmer 😮 at 64.6% of the issues!
Shawn describes how he achieved this massive breakthrough here and we'll be publishing more on this soon, but the highlight for me is he ran over 900 evaluations during the course of this, and tracked all of them in Weave!
We also have an upcoming event in NY, on Jan 22nd, if you're there, come by and learn how to evaluate your AI agents, RAG applications and hang out with our team! (Sign up here)
Big Companies & APIs
OpenAI adds chatGPT tasks - first agentic feature with more to come!
We finally get a glimpse of an agentic chatGPT, in the form of scheduled tasks! Deployed to all users, it is now possible to select gpt-4o with tasks, and schedule tasks in the future.
You can schedule them in natural language, and then will execute a chat (and maybe perform a search or do a calculation) and then send you a notification (and an email!) when the task is done!
A bit underwhelming at first, as I didn't really find a good use for this yet, I don't doubt that this is just a building block for something more Agentic to come that can connect to my email or calendar and do actual tasks for me, not just... save me from typing the chatGPT query at "that time"
Mistral CodeStral 25.01 - a new #1 coding assistant model
An updated Codestral was released at the beginning of the week, and TBH I've never seen the vibes split this fast on a model.
While it's super exciting that Mistral is placing a coding model at #1 on the LMArena CoPilot's arena, near Claude 3.5 and DeepSeek, the fact that this new model is not released weights is really a bummer (especially as a reference to the paragraph I mentioned on top)
We seem to be closing down on OpenSource in the west, while the Chinese labs are absolutely crushing it (while also releasing in the open, including Weights, Technical papers).
Mistral has released this model in API and via a collab with the Continue dot dev coding agent, but they used to be the darling of the open source community by releasing great models!
Also notable, a very quick new benchmark post release was dropped that showed a significant difference between their reported benchmarks and how it performs on Aider polyglot
There was way more things for this week than we were able to cover, including a new and exciting transformers squared new architecture from Sakana, a new open source TTS with voice cloning and a few other open source LLMs, one of which cost only $450 to train! All the links in the TL;DR below!
TL;DR and show notes
* Open Source LLMs
* MiniMax-01 from Hailuo - 4M context 456B (45B A) LLM (Github, HF, Blog, Report)
* Jina - reader V2 model - HTML 2 Markdown/JSON (HF)
* InternLM3-8B-Instruct - apache 2 License (Github, HF)
* OpenBMB - MiniCPM-o 2.6 - Multimodal Live Streaming on Your Phone (HF, Github, Demo)
* KyutAI - Helium-1 2B - Base (X, HF)
* Dria-Agent-α - 3B model that outputs python code (HF)
* Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450 (blog)
* Big CO LLMs + APIs
* OpenAI launches ChatGPT tasks (X)
* Mistral - new CodeStral 25.01 (Blog, no Weights)
* Sakana AI - Transformer²: Self-Adaptive LLMs (Blog)
* This weeks Buzz
* Evaluating RAG Applications Workshop - NY, Jan 22, W&B and PineCone (Free Signup)
* Our evaluations course is going very strong! (chat w/ Graham Neubig) (https://wandb.me/evals-t)
* Vision & Video
* Luma releases Ray2 video model (Web)
* Voice & Audio
* Hailuo T2A-01-HD - Emotions Audio Model from Hailuo (X, Try It)
* OuteTTS 0.3 - 1B & 500M - zero shot voice cloning model (HF)
* Kokoro.js - 80M SOTA TTS in your browser! (X, Github, try it )
* AI Art & Diffusion & 3D
* Black Forest Labs - Finetuning for Flux Pro and Ultra via API (Blog)
* Show Notes and other Links
* Hosts - Alex Volkov (@altryne), Wolfram RavenWlf (@WolframRvnwlf), Nisten Tahiraj (@nisten)
* Guest - Graham Neubig (@gneubig) from All Hands AI (@allhands_ai)
* Graham’s mentioned Agents blogpost - 8 things that agents can do right now
* Projects - Open Hands (previously Open Devin) - Github
* Germany meetup in Cologne (here)
* Toronto Tinkerer Meetup *Sold OUT* (Here)
* YaRN conversation we had with the Authors (coverage)
See you folks next week! Have a great long weekend if you’re in the US 🫡
Please help to promote the podcast and newsletter by sharing with a friend!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visa fler

Avsnitt

📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news

📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week

📆 ThursdAI - May 15 - Genocidal Grok, ChatGPT 4.1, AM-Thinking, Distributed LLM training & more AI news

ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news

📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news

ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!

ThursdAI - Apr 17 - OpenAI o3 is SOTA llm, o4-mini, 4.1, mini, nano, G. Flash 2.5, Kling 2.0 and 🐬 Gemma? Huge AI week + A2A protocol interview

💯 ThursdAI - 100th episode 🎉 - Meta LLama 4, Google tons of updates, ChatGPT memory, WandB MCP manifesto & more AI news

ThursdAI - Apr 3rd - OpenAI Goes Open?! Gemini Crushes Math, AI Actors Go Hollywood & MCP, Now with Observability?

📆 ThursdAI - Mar 27 - Gemini 2.5 Takes #1, OpenAI Goes Ghibli, DeepSeek V3 Roars, Qwen Omni, Wandb MCP & more AI news

ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news

📆 ThursdAI Turns Two! 🎉 Gemma 3, Gemini Native Image, new OpenAI tools, tons of open source & more AI news

ThursdAI - Mar 6, 2025 - Alibaba's R1 Killer QwQ, Exclusive Google AI Mode Chat, and MCP fever sweeping the community!

📆 Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news

📆 ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!

📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?

📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news

📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news

📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news

📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news