Avsnitt

  • Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)

    From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well.

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!

    TL;DR of all topics covered:

    * Open Source LLMs

    * Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)

    * Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)

    * Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)

    * Big CO LLMs + APIs

    * OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI

    * Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)

    * This weeks Buzz

    * Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++

    * Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo

    * Voice & Audio

    * Meta also launches voice mode (demo)

    * Tools & Others

    * Project ORION - holographic glasses are here! (link)

    Meta gives us new LLaMas and AI hardware

    LLama 3.2 Multimodal 11B and 90B

    This was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top.

    LLama 90B is among the best open-source mutlimodal models available

    โ€” Meta team at launch

    These new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs.

    Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks)

    With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.

    Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images.

    Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps ๐Ÿคฏ which marks them as the leading AI services provider in the world now.

    Llama 3.2 Lightweight Models (1B/3B)

    The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters.

    Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later)

    In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model.

    They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!

    Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)

    In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote!

    Speaking AI - not only from OpenAI

    Zuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it.

    And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive.

    AI Hardware was glasses all along?

    Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here?

    Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner.

    They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!

    Project ORION - first holographic glasses

    And of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION.

    Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.

    With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there.

    They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals.

    These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030

    AI usecases

    So what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on a virtual screen and soon you'll show up as an avatar there as well.

    The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need

    I'm so excited about these to finally come to people I screamed in the audience ๐Ÿ‘€๐Ÿ‘“

    OpenAI gives everyone* advanced voice mode

    It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members.

    The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there)

    Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure Seriously, they nerfed the singing! Why OpenAI, why?

    Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation with 1 click.

    *This new mode is not available in EU

    This weeks Buzz - our new advanced RAG++ course is live

    I had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more!

    New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-pro

    It seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin?

    Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!

    Not only are these models 50% cheaper now (Pro price went down by 50% on

  • Hey folks, Alex here, back with another ThursdAI recap โ€“ and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi.

    We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it? ๐Ÿ) settle in, and let's recap the AI awesomeness that went down on ThursdAI, September 19th!

    ThursdAI is brought to you (as always) by Weights & Biases, we still have a few spots left in our Hackathon this weekend and our new advanced RAG course is now released and is FREE to sign up!

    TL;DR of all topics + show notes and links

    * Open Source LLMs

    * Alibaba Qwen 2.5 models drop + Qwen 2.5 Math and Qwen 2.5 Code (X, HF, Blog, Try It)

    * Qwen 2.5 Coder 1.5B is running on a 4 year old phone (Nisten)

    * KyutAI open sources Moshi & Mimi (Moshiko & Moshika) - end to end voice chat model (X, HF, Paper)

    * Microsoft releases GRIN-MoE - tiny (6.6B active) MoE with 79.4 MMLU (X, HF, GIthub)

    * Nvidia - announces NVLM 1.0 - frontier class multimodal LLMS (no weights yet, X)

    * Big CO LLMs + APIs

    * OpenAI O1 results from LMsys do NOT disappoint - vibe checks also confirm, new KING llm in town (Thread)

    * NousResearch announces Forge in waitlist - their MCTS enabled inference product (X)

    * This weeks Buzz - everything Weights & Biases related this week

    * Judgement Day (hackathon) is in 2 days! Still places to come hack with us Sign up

    * Our new RAG Course is live - learn all about advanced RAG from WandB, Cohere and Weaviate (sign up for free)

    * Vision & Video

    * Youtube announces DreamScreen - generative AI image and video in youtube shorts ( Blog)

    * CogVideoX-5B-I2V - leading open source img2video model (X, HF)

    * Runway, DreamMachine & Kling all announce text-2-video over API (Runway, DreamMachine)

    * Runway announces video 2 video model (X)

    * Tools

    * Snap announces their XR glasses - have hand tracking and AI features (X)

    Open Source Explosion!

    ๐Ÿ‘‘ Qwen 2.5: new king of OSS llm models with 12 model releases, including instruct, math and coder versions

    This week's open-source highlight was undoubtedly the release of Alibaba's Qwen 2.5 models. We had Justin Lin from the Qwen team join us live to break down this monster drop, which includes a whopping seven different sizes, ranging from a nimble 0.5B parameter model all the way up to a colossal 72B beast! And as if that wasn't enough, they also dropped Qwen 2.5 Coder and Qwen 2.5 Math models, further specializing their LLM arsenal. As Justin mentioned, they heard the community's calls for 14B and 32B models loud and clear โ€“ and they delivered! "We do not have enough GPUs to train the models," Justin admitted, "but there are a lot of voices in the community...so we endeavor for it and bring them to you." Talk about listening to your users!

    Trained on an astronomical 18 trillion tokens (thatโ€™s even more than Llama 3.1 at 15T!), Qwen 2.5 shows significant improvements across the board, especially in coding and math. They even open-sourced the previously closed-weight Qwen 2 VL 72B, giving us access to the best open-source vision language models out there. With a 128K context window, these models are ready to tackle some serious tasks. As Nisten exclaimed after putting the 32B model through its paces, "It's really practicalโ€ฆI was dumping in my docs and my code base and then like actually asking questions."

    It's safe to say that Qwen 2.5 coder is now the best coding LLM that you can use, and just in time for our chat, a new update from ZeroEval confirms, Qwen 2.5 models are the absolute kings of OSS LLMS, beating Mistral large, 4o-mini, Gemini Flash and other huge models with just 72B parameters ๐Ÿ‘

    Moshi: The Chatty Cathy of AI

    We've covered Moshi Voice back in July, and they have promised to open source the whole stack, and now finally they did! Including the LLM and the Mimi Audio Encoder!

    This quirky little 7.6B parameter model is a speech-to-speech marvel, capable of understanding your voice and responding in kind. It's an end-to-end model, meaning it handles the entire speech-to-speech process internally, without relying on separate speech-to-text and text-to-speech models.

    While it might not be a logic genius, Moshi's real-time reactions are undeniably uncanny. Wolfram Ravenwolf described the experience: "It's uncanny when you don't even realize you finished speaking and it already starts to answer." The speed comes from the integrated architecture and efficient codecs, boasting a theoretical response time of just 160 milliseconds!

    Moshi uses (also open sourced) Mimi neural audio codec, and achieves 12.5 Hz representation with just 1.1 kbps bandwidth.

    You can download it and run on your own machine or give it a try here just don't expect a masterful conversationalist hehe

    Gradient-Informed MoE (GRIN-MoE): A Tiny Titan

    Just before our live show, Microsoft dropped a paper on GrinMoE, a gradient-informed Mixture of Experts model. We were lucky enough to have the lead author, Liyuan Liu (aka Lucas), join us impromptu to discuss this exciting development. Despite having only 6.6B active parameters (16 x 3.8B experts), GrinMoE manages to achieve remarkable performance, even outperforming larger models like Phi-3 on certain benchmarks. It's a testament to the power of clever architecture and training techniques. Plus, it's open-sourced under the MIT license, making it a valuable resource for the community.

    NVIDIA NVLM: A Teaser for Now

    NVIDIA announced NVLM 1.0, their own set of multimodal LLMs, but alas, no weights were released. Weโ€™ll have to wait and see how they stack up against the competition once they finally let us get our hands on them. Interestingly, while claiming SOTA on some vision tasks, they haven't actually compared themselves to Qwen 2 VL, which we know is really really good at vision tasks ๐Ÿค”

    Nous Research Unveils Forge: Inference Time Compute Powerhouse (beating o1 at AIME Eval!)

    Fresh off their NousCon event, Karan and Shannon from Nous Research joined us to discuss their latest project, Forge. Described by Shannon as "Jarvis on the front end," Forge is an inference engine designed to push the limits of whatโ€™s possible with existing LLMs. Their secret weapon? Inference-time compute. By implementing sophisticated techniques like Monte Carlo Tree Search (MCTS), Forge can outperform larger models on complex reasoning tasks beating OpenAI's o1-preview at the AIME Eval, competition math benchmark, even with smaller, locally runnable models like Hermes 70B. As Karan emphasized, โ€œWeโ€™re actually just scoring with Hermes 3.1, which is available to everyone already...we can scale it up to outperform everything on math, just using a system like this.โ€

    Forge isn't just about raw performance, though. It's built with usability and transparency in mind. Unlike OpenAI's 01, which obfuscates its chain of thought reasoning, Forge provides users with a clear visual representation of the model's thought process. "You will still have access in the sidebar to the full chain of thought," Shannon explained, adding, โ€œThereโ€™s a little visualizer and it will show you the trajectory through the treeโ€ฆ youโ€™ll be able to see exactly what the model was doing and why the node was selected.โ€ Forge also boasts built-in memory, a graph database, and even code interpreter capabilities, initially supporting Python, making it a powerful platform for building complex LLM applications.

    Forge is currently in a closed beta, but a waitlist is open for eager users. Karan and Shannon are taking a cautious approach to the rollout, as this is Nous Researchโ€™s first foray into hosting a product. For those lucky enough to gain access, Forge offers a tantalizing glimpse into the future of LLM interaction, promising greater transparency, improved reasoning, and more control over the model's behavior.

    For ThursdAI readers early, here's a waitlist form to test it out!

    Big Companies and APIs: The Reasoning Revolution

    OpenAIโ€™s 01: A New Era of LLM Reasoning

    The big story in the Big Tech world is OpenAI's 01. Since we covered it live last week as it dropped, many of us have been playing with these new reasoning models, and collecting "vibes" from the community. These models represent a major leap in reasoning capabilities, and the results speak for themselves.

    01 Preview claimed the top spot across the board on the LMSys Arena leaderboard, demonstrating significant improvements in complex tasks like competition math and coding. Even the smaller 01 Mini showed impressive performance, outshining larger models in certain technical areas. (and the jump in ELO score above the rest in MATH is just incredible to see!) and some folks made this video viral, of a PHD candidate reacting to 01 writing in 1 shot, code that took him a year to write, check it out, itโ€™s priceless.

    One key aspect of 01 is the concept of โ€œinference-time computeโ€. As Noam Brown from OpenAI calls it, this represents a "new scaling paradigm", allowing the model to spend more time โ€œthinkingโ€ during inference, leading to significantly improved performance on reasoning tasks. The implications of this are vast, opening up the possibility of LLMs tackling long-horizon problems in areas like drug discovery and physics.

    However, the opacity surrounding 01โ€™s chain of thought reasoning being hidden/obfuscated and the ban on users asking about it was a major point of contention at least within the ThursdAI chat. As Wolfram Ravenwolf put it, "The AI gives you an answer and you can't even ask how it got there. That is the wrong direction." as he was referring to the fact that not only is asking about the reasoning impossible, some folks were actually getting threatening emails and getting banned from using the product all together ๐Ÿ˜ฎ

    This Week's Buzz: Hackathons and RAG Courses!

    We're almost ready to host our Weights & Biases Judgment Day Hackathon (LLMs as a judge, anyone?) with a few spots left, so if you're reading this and in SF, come hang out with us!

    And the main thing I gave an update about is our Advanced RAG course, packed with insights from experts at Weights & Biases, Cohere, and Weaviate. Definitely check those out if you want to level up your LLM skills (and it's FREE in our courses academy!)

    Vision & Video: The Rise of Generative Video

    Generative video is having its moment, with a flurry of exciting announcements this week. First up, the open-source CogVideoX-5B-I2V, which brings accessible image-to-video capabilities to the masses. It's not perfect, but being able to generate video on your own hardware is a game-changer.

    On the closed-source front, YouTube announced the integration of generative AI into YouTube Shorts with their DreamScreen feature, bringing AI-powered video generation to a massive audience. We also saw API releases from three leading video model providers: Runway, DreamMachine, and Kling, making it easier than ever to integrate generative video into applications. Runway even unveiled a video-to-video model, offering even more control over the creative process, and it's wild, check out what folks are doing with video-2-video!

    One last thing here, Kling is adding a motion brush feature to help users guide their video generations, and it just looks so awesome I wanted to show you

    Whew! That was one hell of a week, tho from the big companies perspective, it was a very slow week, getting a new OSS king, an end to end voice model and a new hint of inference platform from Nous, and having all those folks come to the show was awesome!

    If you're reading all the way down to here, it seems that you like this content, why not share it with 1 or two friends? ๐Ÿ‘‡ And as always, thank you for reading and subscribing! ๐Ÿซถ

    P.S - Iโ€™m traveling for the next two weeks, and this week the live show was live recorded from San Francisco, thanks to my dear friends swyx & Alessio for hosting my again in their awesome Latent Space pod studio at Solaris SF!



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Saknas det avsnitt?

    Klicka här för att uppdatera flödet manuellt.

  • March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again!

    Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored ๐Ÿ“ thinking model from OpenAI, dropped as breaking news in the middle of ThursdAI live show, giving us plenty of time to react live!

    But before this, we already had an amazing show with some great guests! Devendra Chaplot from Mistral came on and talked about their newly torrented (yeah they did that again) Pixtral VLM, their first multi modal! , and then I had the honor to host Steven Johnson and Raiza Martin from NotebookLM team at Google Labs which shipped something so uncannily good, that I legit said "holy fu*k" on X in a reaction!

    So let's get into it (TL;DR and links will be at the end of this newsletter)

    OpenAI o1, o1 preview and o1-mini, a series of new "reasoning" models

    This is it folks, the strawberries have bloomed, and we finally get to taste them. OpenAI has released (without a waitlist, 100% rollout!) o1-preview and o1-mini models to chatGPT and API (tho only for tier-5 customers) ๐Ÿ‘ and are working on releasing 01 as well.

    These are models that think before they speak, and have been trained to imitate "system 2" thinking, and integrate chain-of-thought reasoning internally, using Reinforcement Learning and special thinking tokens, which allows them to actually review what they are about to say before they are saying it, achieving remarkable results on logic based questions.

    Specifically you can see the jumps in the very very hard things like competition math and competition code, because those usually require a lot of reasoning, which is what these models were trained to do well.

    New scaling paradigm

    Noam Brown from OpenAI calls this a "new scaling paradigm" and Dr Jim Fan explains why, with this new way of "reasoning", the longer the model thinks - the better it does on reasoning tasks, they call this "test-time compute" or "inference-time compute" as opposed to compute that was used to train the model. This shifting of computation down to inference time is the essence of the paradigm shift, as in, pre-training can be very limiting computationally as the models scale in size of parameters, they can only go so big until you have to start building out a huge new supercluster of GPUs to host the next training run (Remember Elon's Colossus from last week?).

    The interesting thing to consider here is, while current "thinking" times are ranging between a few seconds to a minute, imagine giving this model hours, days, weeks to think about new drug problems, physics problems ๐Ÿคฏ.

    Prompting o1

    Interestingly, a new prompting paradigm has also been introduced. These models now have CoT (think "step by step") built-in, so you no longer have to include it in your prompts. By simply switching to o1-mini, most users will see better results right off the bat. OpenAI has worked with the Devin team to test drive these models, and these folks found that asking the new models to just give the final answer often works better and avoids redundancy in instructions.

    The community of course will learn what works and doesn't in the next few hours, days, weeks, which is why we got 01-preview and not the actual (much better) o1.

    Safety implications and future plans

    According to Greg Brokman, this inference time compute also greatly helps with aligning the model to policies, giving it time to think about policies at length, and improving security and jailbreak preventions, not only logic.

    The folks at OpenAI are so proud of all of the above that they have decided to restart the count and call this series o1, but they did mention that they are going to release GPT series models as well, adding to the confusing marketing around their models.

    Open Source LLMs

    Reflecting on Reflection 70B

    Last week, Reflection 70B was supposed to launch live on the ThursdAI show, and while it didn't happen live, I did add it in post editing, and sent the newsletter, and packed my bag, and flew for my vacation. I got many DMs since then, and at some point couldn't resist checking and what I saw was complete chaos, and despite this, I tried to disconnect still until last night.

    So here's what I could gather since last night. The claims of a llama 3.1 70B finetune that Matt Shumer and Sahil Chaudhary from Glaive beating Sonnet 3.5 are proven false, nobody was able to reproduce those evals they posted and boasted about, which is a damn shame.

    Not only that, multiple trusted folks from our community, like Kyle Corbitt, Alex Atallah have reached out to Matt in to try to and get to the bottom of how such a thing would happen, and how claims like these could have been made in good faith. (or was there foul play)

    The core idea of something like Reflection is actually very interesting, but alas, the inability to replicate, but also to stop engaging with he community openly (I've reached out to Matt and given him the opportunity to come to the show and address the topic, he did not reply), keep the model on hugging face where it's still trending, claiming to be the world's number 1 open source model, all these smell really bad, despite multiple efforts on out part to give the benefit of the doubt here.

    As for my part in building the hype on this (last week's issues till claims that this model is top open source model), I addressed it in the beginning of the show, but then twitter spaces crashed, but unfortunately as much as I'd like to be able to personally check every thing I cover, I often have to rely on the reputation of my sources, which is easier with established big companies, and this time this approached failed me.

    This weeks Buzzzzzz - One last week till our hackathon!

    Look at this point, if you read this newsletter and don't know about our hackathon, then I really didn't do my job prompting it, but it's coming up, September 21-22 ! Join us, it's going to be a LOT of fun!

    ๐Ÿ–ผ๏ธ Pixtral 12B from Mistral

    Mistral AI burst onto the scene with Pixtral, their first multimodal model! Devendra Chaplot, research scientist at Mistral, joined ThursdAI to explain their unique approach, ditching fixed image resolutions and training a vision encoder from scratch.

    "We designed this from the ground up to...get the most value per flop," Devendra explained. Pixtral handles multiple images interleaved with text within a 128k context window - a far cry from the single-image capabilities of most open-source multimodal models. And to make the community erupt in thunderous applause (cue the clap emojis!) they released the 12 billion parameter model under the ultra-permissive Apache 2.0 license. You can give Pixtral a whirl on Hyperbolic, HuggingFace, or directly through Mistral.

    DeepSeek 2.5: When Intelligence Per Watt is King

    Deepseek 2.5 launched amid the reflection news and did NOT get the deserved attention it.... deserves. It folded (no deprecated) Deepseek Coder into 2.5 and shows incredible metrics and a truly next-gen architecture. "It's like a higher order MOE", Nisten revealed, "which has this whole like pile of brain and it just like picks every time, from that." ๐Ÿคฏ. DeepSeek 2.5 achieves maximum "intelligence per active parameter"

    Google's turning text into AI podcast for auditory learners with Audio Overviews

    Today I had the awesome pleasure of chatting with Steven Johnson and Raiza Martin from the NotebookLM team at Google Labs. NotebookLM is a research tool, that if you haven't used, you should definitely give it a spin, and this week they launched something I saw in preview and was looking forward to checking out and honestly was jaw-droppingly impressed today.

    NotebookLM allows you to upload up to 50 "sources" which can be PDFs, web links that they will scrape for you, documents etc' (no multimodality so far) and will allow you to chat with them, create study guides, dive deeper and add notes as you study.

    This week's update allows someone who doesn't like reading, to turn all those sources into a legit 5-10 minute podcast, and that sounds so realistic, that I was honestly blown away. I uploaded a documentation of fastHTML in there.. and well hear for yourself

    The conversation with Steven and Raiza was really fun, podcast definitely give it a listen!

    Not to mention that Google released (under waitlist) another podcast creating tool called illuminate, that will convert ArXiv papers into similar sounding very realistic 6-10 minute podcasts!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    There are many more updates from this week, there was a whole Apple keynote I missed, which had a new point and describe feature with AI on the new iPhones and Apple Intelligence, Google also released new DataGemma 27B, and more things in TL'DR which are posted here in raw format

    See you next week ๐Ÿซก Thank you for being a subscriber, weeks like this are the reason we keep doing this! ๐Ÿ”ฅ Hope you enjoy these models, leave in comments what you think about them

    TL;DR in raw format

    * Open Source LLMs

    * Reflect on Reflection 70B & Matt Shumer (X, Sahil)

    * Mixtral releases Pixtral 12B - multimodal model (X, try it)

    * Pixtral is really good at OCR says swyx

    * Interview with Devendra Chaplot on ThursdAI

    * Initial reports of Pixtral beating GPT-4 on WildVision arena from AllenAI

    * JinaIA reader-lm-0.5b and reader-lm-1.5b (X)

    * ZeroEval updates

    * Deepseek 2.5 -

    * Deepseek coder is now folded into DeepSeek v2.5

    * 89 HumanEval (up from 84 from deepseek v2)

    * 9 on MT-bench

    * Google - DataGemma 27B (RIG/RAG) for improving results

    * Retrieval-Interleaved Generation

    * ๐Ÿค– DataGemma: AI models that connect LLMs to Google's Data Commons

    * ๐Ÿ“Š Data Commons: A vast repository of trustworthy public data

    * ๐Ÿ” Tackling AI hallucination by grounding LLMs in real-world data

    * ๐Ÿ” Two approaches: RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation)

    * ๐Ÿ” Preliminary results show enhanced accuracy and reduced hallucinations

    * ๐Ÿ”“ Making DataGemma open models to enable broader adoption

    * ๐ŸŒ Empowering informed decisions and deeper understanding of the world

    * ๐Ÿ” Ongoing research to refine the methodologies and scale the work

    * ๐Ÿ” Integrating DataGemma into Gemma and Gemini AI models

    * ๐Ÿค Collaborating with researchers and developers through quickstart notebooks

    * Big CO LLMs + APIs

    * Apple event

    * Apple Intelligence - launching soon

    * Visual Intelligence with a dedicated button

    * Google Illuminate - generate arXiv paper into multiple speaker podcasts (Website)

    * 5-10 min podcasts

    * multiple speakers

    * any paper

    * waitlist

    * has samples

    * sounds super cool

    * Google NotebookLM is finally available - multi modal research tool + podcast (NotebookLM)

    * Has RAG like abilities, can add sources from drive or direct web links

    * Currently not multimodal

    * Generation of multi speaker conversation about this topic to present it, sounds really really realistic

    * Chat with Steven and Raiza

    * OpenAI reveals new o1 models, and launches o1 preview and o1-mini in chat and API (X, Blog)

    * Trained with RL to think before it speaks with special thinking tokens (that you pay for)

    * new scaling paradigm

    * This weeks Buzz

    * Vision & Video

    * Adobe announces Firefly video model (X)

    * Voice & Audio

    * Hume launches EVI 2 (X)

    * Fish Speech 1.4 (X)

    * Instant Voice Cloning

    * Ultra low latenc

    * ~1GB model weights

    * LLaMA-Omni, a new model for speech interaction (X)

    * Tools

    * New Jina reader (X)



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine?

    Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more.

    In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days ๐Ÿ˜ฎ and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?!

    TL;DR

    * Open Source LLMs

    * Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (X, HF)

    * Allen AI - OLMoE - first "good" MoE 100% OpenSource (X, Blog, Paper, WandB)

    * RWKV.cpp is deployed with Windows to 1.5 Billion devices

    * MMMU pro - more robust multi disipline multimodal understanding bench (proj)

    * 01AI - Yi-Coder 1.5B and 9B (X, Blog, HF)

    * Big CO LLMs + APIs

    * Replit launches Agent in beta - from coding to production (X, Try It)

    * Ilya SSI announces 1B round from everyone (Post)

    * Cohere updates Command-R and Command R+ on API (Blog)

    * Claude Enterprise with 500K context window (Blog)

    * Claude invisibly adds instructions (even via the API?) (X)

    * Google got structured output finally (Docs)

    * Amazon to include Claude in Alexa starting this October (Blog)

    * X ai scaled Colossus to 100K H100 GPU goes online (X)

    * DeepMind - AlphaProteo new paper (Blog, Paper, Video)

    * This weeks Buzz

    * Hackathon did we mention? We're going to have Eugene and Greg as Judges!

    * AI Art & Diffusion & 3D

    * ByteDance - LoopyAvatar - Audio Driven portait avatars (Page)

    Open Source LLMs

    Reflection Llama-3.1 70B - new ๐Ÿ‘‘ open source LLM from Matt Shumer / GlaiveAI

    This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way.

    This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, RefleXion (arxiv.org/2303.11366) came out a year ago, so what's new?

    What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer!

    And the results are quite incredible for a 70B model ๐Ÿ‘‡

    Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% ๐Ÿ˜ฎ and gets very close to Claude on GPQA and HumanEval!

    (Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)

    This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that!

    Kudos on this work, go give Matt Shumer and Glaive AI a follow!

    Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logs

    We've previously covered OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced.

    This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything.

    By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (Github) and even (and maybe the best part) the Weights & Biases logs!

    It was a pleasure to host the leading author of the OLMoE paper, Niklas Muennighoff on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot!

    Big Companies LLMs + API

    Anthropic has 500K context window Claude but only for Enterprise?

    Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering.

    This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for.

    To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to Claude.ai and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc'

    Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so.

    Prompt injecting must stop!

    And lastly, there have been mounting evidence, including our own Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!

    XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUS

    This is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs.

    This took just 122 days for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4!

    Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in just 4 days ๐Ÿคฏ

    XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI reportedly worried

    This weeks buzz - we're in SF in less than two weeks, join our hackathon!

    This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to sign up and join us

    I'm so honored to announce that we'll have Eugene Yan (@eugeneyan), Greg Kamradt (@GregKamradt) and Charles Frye (@charles_irl) on the Judges panel. ๐Ÿคฉ It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer!

    Replit launches Agents beta - a fully integrated code โ†’ deployment agent

    Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while.

    Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.

    Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this!

    The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go!

    In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try!

    The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane.

    Can't wait to see what I can spin up with this ๐Ÿ”ฅ (and show all of you!)

    Loopy - Animated Avatars from ByteDance

    A new animated avatar project from folks at ByteDance just dropped, and itโ€™s WAY clearer than anything weโ€™ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything!

    I of course wanted to know if Iโ€™ll ever be able to use this, and .. likely no, hereโ€™s the response I got from Jianwen one of the Authors today.

    That's it for this week, we've talked about so much more in the pod, please please check it out.

    As for me, while so many exciting things are happening, I'm going on a small ๐Ÿ๏ธ vacation until next ThursdAI, which will happen on schedule, so planning to decompress and disconnect, but will still be checking in, so if you see things that are interesting, please tag me on X ๐Ÿ™

    P.S - I want to shout out a dear community member that's been doing just that, @PresidentLin has been tagging me in many AI related releases, often way before I would even notice them, so please give them a follow! ๐Ÿซก

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :)

    This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more!

    As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate?

    TL;DR

    * Open Source LLMs

    * Nous DisTrO - Distributed Training (X , Report)

    * NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)

    * LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)

    * Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)

    * Big CO LLMs + APIs

    * Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)

    * Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)

    * Google adds Gems & Imagen to Gemini paid tier

    * Anthropic artifacts available to all users + on mobile (Blog, Try it)

    * Anthropic publishes their system prompts with model releases (release notes)

    * OpenAI has project Strawberry coming this fall (via The information)

    * This weeks Buzz

    * WandB Hackathon hackathon hackathon (Register, Join)

    * Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)

    * Vision & Video

    * Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)

    * Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)

    * AI Art & Diffusion & 3D

    * GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)

    * FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)

    * Tools & Others

    * SimpleBench from AI Explained - closely matches human experience (simple-bench.com)

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Open Source

    Let's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.

    Nous Research DiStRO + Function Calling V1

    Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center.

    Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now.

    Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)!

    This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous!

    Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!

    LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)

    What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory?

    This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on!

    "If you're doing any form of finetuning, using this is an instant win"Wing Lian - Axolotl

    This absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!

    Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!

    Qwen-2 VL - SOTA image and video understanding + open weights mini VLM

    You may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!)

    But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license!

    First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A.

    Additional Capabilities & Smaller models

    They have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius.

    They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings.

    Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either!

    7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example!

    I can't wait to finish writing and go play with these models!

    Big Companies & LLM APIs

    The biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)

    Cerebras - fastest LLM inference on wafer scale chips

    Cerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world.

    And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary!

    "The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James Wang

    They not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers.

    The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed.

    P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard here

    Anthropic - unlocking just-in-time applications with artifacts for all

    Well, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers.

    Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etc

    Effectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of.

    Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill.

    Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works!

    Googleโ€™s Gemini Keeps Climbing the Charts (But Will It Be Enough?)

    Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental Gemini 1.5 Pro (0827) with sharper coding skills and logical reasoning. According to LMSys, itโ€™s already nipping at the heels of ChatGPT 4o and is number 2!

    Their Gemini 1.5 Flash model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.

    Oh, and those long-awaited Gems are finally starting to roll out. But get ready to open your wallet โ€“ this feature (preloading Gemini with custom context and capabilities) is a paid-tier exclusive. But hey, at least Imagen-3 is cautiously returning to the image generation game!

    AI Art & Diffusion

    Doom Meets Stable Diffusion: AI Dreams in 20FPS Glory (GameNGen)

    The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters. ๐Ÿคฏ

    This week, researchers in DeepMind blew everyone's minds with their GameNgen research. What did they do? They trained Stable Diffusion 1.4 on Doom, and I'm not just talking about static images - I'm talking about generating actual Doom gameplay in near real time. Think 20FPS Doom running on nothing but the magic of AI.

    The craziest part to me is this quote "Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation"

    FAL Drops the LORA Training Time Bomb (and I Get a New Meme!)

    As you see, I haven't yet relaxed from making custom AI generations with Flux and customizing them with training LORAs. Two weeks ago, this used to take 45 minutes, a week ago, 20 minutes, and now, the wizards at FAL, created a new trainer that shrinks the training times down to less than 5 minutes!

    So given that the first upcoming SpaceX commercial spacewalk Polaris Dawn, I trained a SpaceX astronaut LORA and then combined my face with it, and viola, here I am, as a space X astronaut!

    BTW because they are awesome, Jonathan and Simo (who is the magician behind this new trainer) came to the show, announced the new trainer, but also gave all listeners of ThursdAI a coupon to train a LORA effectively for free, just use this link and start training! (btw I get nothing out of this, just trying to look out for my listeners!)

    That's it for this week, well almost that's it, magic.dev announced a new funding round of 320 million, and that they have a 100M context window capable models and coding product to go with it, but didn't yet release it, just as we were wrapping up. Sam Altman tweeted that OpenAI now has over 200 Million active users on ChatGPT and that OpenAI will collaborate with AI Safety institute.

    Ok now officially that's it! See you next week, when it's going to be ๐Ÿ already brrr

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it?

    It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!

    This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easily

    Also this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week:

    TL;DR

    * Open Source LLMs

    * AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)

    * Microsoft Phi 3.5 - 3 new models including MoE (X, HF)

    * BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)

    * NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)

    * Cohere paper proves - code improves intelligence (X, Paper)

    * MOHAWK - transformer โ†’ Mamba distillation method (X, Paper, Blog)

    * AI Art & Diffusion & 3D

    * Ideogram launches v2 - new img diffusion king ๐Ÿ‘‘ + API (X, Blog, Try it)

    * Midjourney is now on web + free tier (try it finally)

    * Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)

    * Procreate hates generative AI (X)

    * Big CO LLMs + APIs

    * Grok 2 full is finally available on X - performs well on real time queries (X)

    * OpenAI adds GPT-4o Finetuning (blog)

    * Google API updates - 1000 pages PDFs + LOTS of free tokens (X)

    * This weeks Buzz

    * Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)

    * Video

    * Hotshot - new video model - trained by 4 guys (try it, technical deep dive)

    * Luma Dream Machine 1.5 (X, Try it)

    * Tools & Others

    * LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)

    * Vercel - Vo now has chat (X)

    * Ark - a completely offline device - offline LLM + worlds maps (X)

    * Ricky's Daughter coding with cursor video is a must watch (video)

    The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling Heroes

    We kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.

    AI21 Officially Announces Jamba 1.5 Large/Mini โ€“ The Powerhouse Architecture Combines Transformer and Mamba

    While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds.

    Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size โ€“ itโ€™s about using that size effectively.

    AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.

    โ€œAs you mentioned, weโ€™re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models.

    Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestionโ€ฆ basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!

    Berkeley Function Calling Leaderboard V2: Updated + Live (link)

    Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" ๐Ÿ˜Ž? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.

    Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.

    So, whoโ€™s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!

    Now, prepare for the REAL plot twist. The top-performing open-source model isnโ€™t some big name, resource-heavy behemoth. Itโ€™s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.

    โ€œIโ€™ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?โ€ I asked, echoing the collective bafflement in the space. Yep, turns out thereโ€™s gold hidden in the vast landscape of open source models, just waiting to be unearthed โ›๏ธ.

    Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license

    3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.

    However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually.

    Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1

    Of course, we're not complaining, the models released with 128K context and MIT license.

    The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes.

    This weeks Buzz

    Hey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)

    Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes!

    ๐ŸŽจ AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhere

    While there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary!

    ๐Ÿ‘‘ Ideogram v2: Text Wizardry and API Access (But No Lorasโ€ฆ Yet?)

    With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness!

    They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs!

    Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt.

    They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared!

    Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney.

    Meanwhile Flux enjoys the fruits of Open Source

    While the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open.

    Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)

    you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa.

    ๐Ÿค” Is This AI Tool Necessary, Bro?

    Letโ€™s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hateโ€ for Generative AI.

    Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.

    โ€œInstead of trying to bridge the gap, heโ€™s creating more walls", Wolfram commented, echoing the general โ€œdudeโ€ฆ what?โ€ vibe in the space. โ€œIt feels marketeerialโ€, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).

    Hereโ€™s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thingโ€™s undeniable: this tech isn't going anywhere.

    Meanwhile, 8yo coders lean in fully into AI

    As a contrast to this doomerism take, just watch this video of Ricky Robinette's eight-year-old daughter building a Harry Potter website in 45 minutes, using nothing but a chat interface in Cursor. No coding knowledge. No prior experience. Just prompts and the power of AI โœจ.

    THATโ€™s where weโ€™re headed, folks. It might be terrifying. It might be inspiring. But itโ€™s DEFINITELY happening. Better to understand it, engage with it, and maybe try to nudge it in a positive direction, than burying your head in the sand and muttering โ€œI bleeping hate this progressโ€ like a cranky, Luddite hermit. Just sayin' ๐Ÿคทโ€โ™€๏ธ.

    AI Device to reboot civilization (if needed)

    I was scrolling through my feed (as I do VERY often, to bring you this every week) and I saw this and super quickly decided to invite the author to the show to talk about it.

    Adam Cohen Hillel has prototyped an AI hardware device, but this one isn't trying to record you or be your friend, no, this one comes with offline LLMs finetuned with health and bio information, survival tactics, and all of the worlds maps and works completely offline!

    This to me was a very exciting use for an LLM, a distilled version of all human knowledge, buried in a faraday cage, with replaceable batteries that runs on solar and can help you survive in the case of something bad happening, like really bad happening (think a solar flare that takes out the electrical grid or an EMP device). While improbable, I thought this was a great idea and had a nice chat with the creator, you should definitely give this one a listen, and if you want to buy one, he is going to sell them soon here

    This is it for this week, there have been a few updates from the big labs, OpenAI has opened Finetuneing for GPT-4o, and you can use your WandB API key in there to track those, which is cool, Gemini API now accepts incredibly large PDF files (up to 1000 pages) and Grok 2 is finally on X (not mini from last week)

    See you next week (we will have another deep dive!)



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping!

    We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate!

    Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)

    TL;DR of all topics covered:

    * Big CO LLMs + APIs

    * Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (๐•, Blog, Try It)

    * OpenAI releases another ChatGPT-4o (and tops LMsys again) (X, Blog)

    * Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( Blog)

    * Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (X, Blog)

    * AI Art & Diffusion & 3D

    * Flux now has support for LORAs, ControlNet, img2img (Fal, Replicate)

    * Google Imagen-3 is out of secret preview and it looks very good (๐•, Paper, Try It)

    * This weeks Buzz

    * Using Weights & Biases Weave to evaluate Claude Prompt Caching (X, Github, Weave Dash)

    * Open Source LLMs

    * NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (X, Blog, Paper)

    * NVIDIA Llama-3.1-Minitron 4B (Blog, HF)

    * AnswerAI - colbert-small-v1 (Blog, HF)

    * Vision & Video

    * Runway Gen-3 Turbo is now available (Try It)

    Big Companies & LLM APIs

    Grok 2: Real Time Information, Uncensored as Hell, andโ€ฆ Flux?!

    The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of Grok 2. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.

    As Matt Shumer excitedly put it

    โ€œIf this model is this good with less than a year of work, the trajectory theyโ€™re on, it seems like they will be far above this...very very soonโ€ ๐Ÿš€

    Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarksโ€ฆ from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.

    But here's where things get really interesting. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, it's also VERY uncensored. Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way thatโ€™s both hilarious and kinda disturbing all thanks to Grok 2โ€™s integration with Black Forest Labs Flux image generation model.

    With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), itโ€™ll be interesting to see how Grokโ€™s "truth seeking" (as xAI calls it) model plays out. Buckle up, folks, this is going to be wild, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds.

    Oh yeahโ€ฆ and thereโ€™s the upcoming Enterprise API as wellโ€ฆ and Grok 2โ€™s made its debut in the wild on the LMSys Arena, lurking incognito as "sus-column-r" and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!

    OpenAI last ChatGPT is back at #1, but it's all very confusing ๐Ÿ˜ตโ€๐Ÿ’ซ

    As the news about Grok-2 was settling in, OpenAI decided to, wellโ€ฆ drop yet another GPT-4.o update on us. While Google was hosting their event no less. Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)

    So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall.

    It is also available for us developers via API, but... they don't recommend using it? ๐Ÿค”

    The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not ๐Ÿ“ or GPT5)

    Their release notes on this are something else ๐Ÿ‘‡

    Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far.

    Anthropic Releases Prompt Caching to Slash API Prices By up to 90%

    Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with up to 90% cost reduction on cached tokens (yes NINETYโ€ฆ๐Ÿคฏ ) for those of you new to all this technical sorcery

    Prompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'

    We covered caching before with Gemini (in Google IO) and last week with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation.

    And, you'll definitely want to see the code to implement it all yourself, (plus Weave is free!๐Ÿคฉ):

    "In this week's buzz categoryโ€ฆ I used Weave, our LLM observability tooling to super quickly evaluate how much cheaper Cloud Caching from Anthropic really is, I did a video of it and I posted the code โ€ฆ If you're into this and want to see how to actually do this โ€ฆ how to evaluate, the code is there for you" - Alex

    With the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video)

    Code Here + Weave evaluation Dashboard here

    AI Art, Diffusion, and Personalized AI On the Fly

    Speaking of mind blowing, Flux took over this week, thanks in no small part to Elon strategically leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)

    Now, remember, the REAL magic happens when code meets open source, โ€œFlux now has support for LORAs, ControlNet, img2imgโ€ฆ" meaning developers have turned those foundational tools into artistic wizardry. With as little as $5 bucks and a few pictures, โ€œYou can train the best image model on your own face. โ€๐Ÿคฏ (Seriously folks, head up to Fal.ai, give it a whirlโ€ฆ itโ€™s awesome)

    Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction ๐Ÿ˜‚ (I'm sorry you had to see this, AI has gone too far! Shut it all down!)

    If seeing those creepy faces on screen isn't for you (I totally get that) thereโ€™s also Google IMAGEN 3, freshly escaped from secret preview and just waiting for you to unleash those artistic prompts on it! Google, despite beingโ€ฆ Google, somehow figured out that a little competition does a lab good and rolled out a model thatโ€™s seriously impressive.

    Runway Video Gets a "Turbocharged" Upgrade๐Ÿš€๐Ÿš€๐Ÿš€

    Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?๐Ÿ˜ญ Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals in a blink ๐Ÿคฏ๐Ÿคฏ๐Ÿคฏ .

    What's truly cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts right now!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking aboutโ€ฆ) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences better than models even twice its size ๐Ÿคฏ.

    Open Source: Hermes 3 and The Next Generation of Open AI ๐Ÿš€

    NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)

    In the ultimate โ€œWe Dropped This On ThursdAI Before Even HuggingFaceโ€, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: Hermes 3 is officially here! ๐Ÿคฏ

    โ€œYouโ€™re about to get to use the FIRST big Finetune of LLama 3.1 405Bโ€ฆ We donโ€™t think there have been finetunes,โ€ announced Emozilla whoโ€™s both co founder and resident master wizard of all things neural net, โ€œAnd it's available to try for free thanks to Lambda, you can try it out right here โ€ (youโ€™re all racing to their site as I type this, I KNOW it!).

    Not ONLY does this beauty run ridiculously smooth on Lambda, but hereโ€™s the real TL;DR:

    * Hermes 3 isnโ€™t just 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD wayโ€ฆ okay maybe not so great for your power bill ๐Ÿ˜…).

    * On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)

    * Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: โ€œWhen you have a model, and you run it on your system, IT MUST BE LOYAL TO YOU.โ€ ๐Ÿ˜ˆ

    Hermes 3 does just that with incredibly precise control via its godlike system prompt: โ€œIn Hermes 3 the system prompt is KING,โ€ confirmed Emoz. Itโ€™s so powerful that the 405B version was practically suffering existential angst in their first conversationโ€ฆ I read that part outloud during the space, but here you go, this is their first conversation, and he goes into why this they thing this happened, in our chat that's very worth listening to

    This model was trained on a bunch of datasources that they will release in the future, and includes tool use, and a slew of tokens that you can add in the system prompt, that will trigger abilities in this model to do chain of thought, to do scratchpad (think, and then rethink), to cite from sources for RAG purposes and a BUNCH more.

    The technical report is HERE and is worth diving into as is our full conversation with Emozilla on the pod.

    Wrapping Things Upโ€ฆ But Weโ€™re Just Getting Started! ๐Ÿ˜ˆ

    I know, I KNOW, your brain is already overflowing but we barely SCRATCHED the surfaceโ€ฆ

    We also dove into NVIDIA's research into new pruning and distilling techniques, TII Falconโ€™s attempt at making those State Space models finally challenge the seemingly almighty Transformer architecture (it's getting closer... but has a way to go!), plus AnswerAI's deceptively tiny Colbert-Small-V1, achieving remarkable search accuracy despite its featherweight size and a bunch more...

    See you all next week for whatโ€™s bound to be yet another wild AI news bonanzaโ€ฆ Get those download speeds prepped, weโ€™re in for a wild ride. ๐Ÿ”ฅ



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure. ๐Ÿ˜‚

    Theme of this week is, Open Source keeps beating GPT-4, while we're inching towards intelligence too cheap to meter on the API fronts.

    We even had a live demo so epic, folks at the Large Hadron Collider are taking notice! Plus, strawberry shenanigans abound (did Sam REALLY tease GPT-5?), and your favorite AI evangelist nearly got canceled on X! Buckle up; this is gonna be another long one! ๐Ÿš€

    Qwen2-Math Drops a KNOWLEDGE BOMB: Open Source Wins AGAIN!

    When I say "open source AI is unstoppable", I MEAN IT. This week, the brilliant minds from Alibaba's Qwen team decided to show everyone how it's DONE. Say hello to Qwen2-Math-72B-Instruct - a specialized language model SO GOOD at math, it's achieving a ridiculous 84 points on the MATH benchmark. ๐Ÿคฏ

    For context, folks... that's beating GPT-4, Claude Sonnet 3.5, and Gemini 1.5 Pro. We're not talking incremental improvements here - this is a full-blown DOMINANCE of the field, and you can download and use it right now. ๐Ÿ”ฅ

    Get Qwen-2 Math from HuggingFace here

    What made this announcement EXTRA special was that Junyang Lin , the Chief Evangelist Officer at Alibaba Qwen team, joined ThursdAI moments after they released it, giving us a behind-the-scenes peek at the effort involved. Talk about being in the RIGHT place at the RIGHT time! ๐Ÿ˜‚

    They painstakingly crafted a massive, math-specific training dataset, incorporating techniques like Chain-of-Thought reasoning (where the model thinks step-by-step) to unlock this insane level of mathematical intelligence.

    "We have constructed a lot of data with the form of ... Chain of Thought ... And we find that it's actually very effective. And for the post-training, we have done a lot with rejection sampling to create a lot of data sets, so the model can learn how to generate the correct answers" - Junyang Lin

    Now I gotta give mad props to Qwen for going beyond just raw performance - they're open-sourcing this beast under an Apache 2.0 license, meaning you're FREE to use it, fine-tune it, adapt it to your wildest mathematical needs! ๐ŸŽ‰

    But hold on... the awesomeness doesn't stop there! Remember those smaller, resource-friendly LLMs everyone's obsessed with these days? Well, Qwen released 7B and even 1.5B versions of Qwen-2 Math, achieving jaw-dropping scores for their size (70 for the 1.5B?? That's unheard of!).๐Ÿคฏ Nisten nearly lost his mind when he heard that - and trust me, he's seen things. ๐Ÿ˜‚

    "This is insane! This is... what, Sonnet 3.5 gets what, 71? 72? This gets 70? And it's a 1.5B? Like I could run that on someone's watch. Real." - Nisten

    With this level of efficiency, we're talking about AI-powered calculators, tutoring apps, research tools that run smoothly on everyday devices. The potential applications are endless!

    MiniCPM-V 2.6: A Pocket-Sized GPT-4 Vision... Seriously! ๐Ÿคฏ

    If Qwen's Math marvel wasn't enough open-source goodness for ya, OpenBMB had to get in on the fun too! This time, they're bringing the ๐Ÿ”ฅ to vision with MiniCPM-V 2.6 - a ridiculous 8 billion parameter VLM (visual language model) that packs a serious punch, even outperforming GPT-4 Vision on OCR benchmarks!

    OpenBMB drops a bomb on X here

    I'll say this straight up: talking about vision models in a TEXT-based post is hard. You gotta SEE it to believe it. But folks... TRUST ME on this one. This model is mind-blowing, capable of analyzing single images, multi-image sequences, and EVEN VIDEOS with an accuracy that rivaled my wildest hopes for open-source.๐Ÿคฏ

    Check out their playground and prepare to be stunned

    It even captured every single nuance in this viral toddler speed-running video I threw at it, with an accuracy I haven't seen in models THIS small:

    "The video captures a young child's journey through an outdoor park setting. Initially, the child ... is seen sitting on a curved stone pathway besides a fountain, dressed in ... a green t-shirt and dark pants. As the video progresses, the child stands up and begins to walk ..."

    Junyang said that they actually collabbed with the OpenBMB team and knows firsthand how much effort went into training this model:

    "We actually have some collaborations with OpenBMB... it's very impressive that they are using, yeah, multi-images and video. And very impressive results. You can check the demo... the performance... We care a lot about MMMU [the benchmark], but... it is actually relying much on large language models." - Junyang Lin

    Nisten and I have been talking for months about the relationship between these visual "brains" and the larger language model base powering their "thinking." While it seems smaller models are catching up fast, combining a top-notch visual processor like MiniCPM-V with a monster LLM like Quen72B or Llama 405B could unlock truly unreal capabilities.

    This is why I'm excited - open source lets us mix and match like this! We can Frankenstein the best parts together and see what emerges... and it's usually something mind-blowing. ๐Ÿคฏ

    Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.

    From the Large Hadron Collider to YOUR Phone: This Model Runs ANYWHERE ๐Ÿš€

    While Qwen2-Math is breaking records on one hand, Nisten's latest creation, Biggie-SmoLlm, is showcasing the opposite side of the spectrum. Trying to get the smallest/fastest coherent LLM possible, Nisten blew up on HuggingFace.

    Biggie-SmoLlm (Hugging Face) is TINY, efficient, and with some incredible optimization work from the folks right here on the show, it's reaching an insane 330 tokens/second on regular M3 chips. ๐Ÿคฏ That's WAY faster than real-time conversation, folks! And thanks to Eric Hartford's (from Cognitive Computation) awesome new optimizer, (Grok AdamW) it's surprisingly coherent for such a lil' fella.

    The cherry on top? Someone messaged Nisten saying they're using Biggie-SmoLlm at the Large. Hadron. Collider. ๐Ÿ˜ณ I'll let that sink in for a second.

    It was incredible having ALL the key players behind Biggie-SmoLlm right there on stage: LDJ (whose Capybara dataset made it teaching-friendly), Junyang (whose Qwen work served as the base), and Eric (the optimizer mastermind himself). THIS, my friends, is what the ThursdAI community is ALL about! ๐Ÿš€

    Speaking of which this week we got a new friend of the pod, Mark Saroufim, a long time PyTorch core maintainer, to join the community.

    This Week's Buzz (and Yes, It Involves Making AI Even Smarter) ๐Ÿค“

    NeurIPS Hacker Cup 2024 - Can You Solve Problems Humans Struggle With? ๐Ÿค”

    I've gotta hand it to my PyTorch friend, Mark Saroufim. He knows how to make AI interesting! He and his incredible crew (Weiwei from MSFT, some WandB brainiacs, and more) are bringing you NeurIPS Hacker Cup 2024 - a competition to push those coding agents to their ABSOLUTE limits. ๐Ÿš€

    This isn't your typical "LeetCode easy" challenge, folks... These are problems SO hard, years of competitive programming experience are required to even attempt them! Mark himself said,

    โ€œAt this point, like, if a model does make a significant dent in this competition, uh, I think people would need to acknowledge that, like, LLMs can do a form of planning. โ€

    And don't worry, total beginners: Mark and Weights & Biases are hosting a series of FREE sessions to level you up. Get those brain cells prepped and ready for the challenge and then Join the NeurIPS Hacker Cup Discord

    P.S. We're ALSO starting a killer AI Salon series in our SF office August 15th! You'll get a chance to chat with researches like Shreya Shankar - she's a leading voice on evaluation. More details and free tickets right here! AI Salons Link

    Big Co & APIs - Towards intelligence too cheap to meter

    Open-source was crushing it this week... but that didn't stop Big AI from throwing a few curveballs. OpenAI is doubling down on structured data (AND cheaper models!), Google slashed Gemini prices again (as we trend towards intelligence too cheap to meter), and a certain strawberry mystery took over Twitter.

    DeepSeek context caching lowers price by 90% automatiically

    DeepSeek, those masters of ridiculously-good coding AI, casually dropped a bombshell - context caching for their API! ๐Ÿคฏ

    If you're like "wait, what does THAT mean?", listen up because this is game-changing for production-grade AI:

    * Problem: LLMs get fed the ENTIRE conversation history EVERY. SINGLE. TIME. This wastes compute (and $$$) when info is repeated.

    * Solution: DeepSeek now remembers what you've said, automatically pulling from a cache when the conversation goes down familiar paths.

    * The Win: Up to 90% cheaper API calls. Yes, NINETY.๐Ÿ˜ณ It costs 1.4 CENTS per million tokens for cached content. Let THAT sink in. ๐Ÿคฏ

    As Nisten (always bringing the technical breakdowns) explained:

    "Everyone should be using LLMs this way!...The simplest way is to have a long conversation ... then you save it on disk... you don't have to wait again ... [it's] kind of free. DeepSeek... did this in a more dynamic way". - Nisten

    Even Matt Shumer, who usually advocates for clever prompting over massive context, got legitimately hyped about the possibilities:

    "For me, and how we use LLMs... instead of gathering a million examples... curate a hundred gold examples... you have something better than if you fine-tuned it, and cheaper, and faster..." - Matt Shumer

    Think about this... instead of painstakingly fine-tuning, we can "guide" models with expertly crafted examples, letting them learn "on the fly" with minimal cost. Context as the NEW fine-tuning! ๐Ÿคฏ

    P.S - Google actually also has caching on its Gemini API, but you have to opt-in, while this happens automatically with DeepSeek API!

    Google Goes "Price War Nuclear": Gemini Flash is Officially TOO CHEAP

    Speaking of sneaky advancements from Google... they also dropped an update SO casually impactful, it almost got lost in the shuffle. Gemini Flash (their smallest, but still crazy-good model) is now... 7.5 cents per million tokens for input and 30 cents per million tokens for output... (for up to 128k of context)

    I REPEAT: 7.5 cents... with LONG context!? ๐Ÿคฏ Google, please chill, MY SANITY cannot handle this price free-fall any longer! ๐Ÿ˜‚

    Full Breakdown of Geminiโ€™s Crazy New Prices on Googleโ€™s Blog

    While this USUALLY means a model's performance gets quietly nerfed in exchange for lower costs... in Gemini's case? Let's just say... even I, a staunch defender of open-source, am kinda SHOOK by how GOOD this thing is NOW!

    After Google threw down this gauntlet, I actually used Gemini to draft my last ThursdAI newsletter (for the first time!). It nailed my tone and style better than any other model I've tried - and I've TRIED them ALL. ๐Ÿคฏ Even Nisten, who's super picky about his coding LLMs, gave it a rare nod of approval. Gemini's image understanding capabilities have improved significantly too! ๐Ÿคฏ

    Google also added improvements in how Gemini understands PDFs that are worth mentioning ๐Ÿ‘€

    From JSON Headaches to REASONING Gains: What's Really New with GPT-4?

    While Matt Shumer, my go-to expert on all things practical AI, might not be immediately impressed by OpenAI's new structured output features, they're still a huge win for many developers. Tired of LLM JSON going haywire? Well, GPT-4 can now adhere to your exact schemas, delivering 100% reliable structured data, no need for Instructor! ๐Ÿ™Œ

    This solves a real problem, even if the prompting gurus (like Matt) have figured out their own workarounds. The key is:

    * Determinism: This ain't your typical LLM chaos - they're guaranteeing consistency, essential for building reliable applications.

    * Ease of use: No need for external libraries - it's built right into the API!

    Plus... a sneaky price drop, folks! GPT-4 is now 50% cheaper for input tokens and 33% cheaper for output. As I said on the show:

    "Again, quite insane... we're getting 50% cheaper just without fanfare. We're going towards 'intelligence too cheap to meter'... it's crazy".

    And HERE'S the plot twist... multiple folks on stage (including the eager newcomer N8) noticed significant reasoning improvements in this new GPT-4 model. They tested it on tasks like lateral thinking puzzles and even anecdotally challenging tasks - and guess what? It consistently outperformed older versions. ๐Ÿคฏ

    "I have my own benchmark... of lateral thinking puzzles... the new GPT-4 [scored] roughly five to 10% higher... these are like really hard lateral thinking puzzles that require innovative reasoning ability". - N8

    OpenAI isn't bragging about this upgrade explicitly, which makes me even MORE curious... ๐Ÿค”

    Mistral Joins the AGENT Hype Train (But Their Version is Different)

    Everybody wants a piece of that AI "Agent" pie, and now Mistral (the scrappy, efficient French company) is stepping up. They announced a double whammy this week: fine-tuning is here AND "les agents" have arrived... but their agents are NOT quite what we're seeing elsewhere (think AutoGPT, CrewAI, all those looped assistants). ๐Ÿค”

    Mistral's Blog Post - Fine-tuning & Agents... Ooh La La!

    Their fine-tuning service is pretty straightforward: upload your data and they'll host a bespoke Mistral Large V2 running through their API at no extra cost (very cool!).

    Their agents aren't based on agentic loop-running like what we see from those recursive assistants. As I pointed out on ThursdAI:

    "[Mistral] agents are not agentic... They're more similar to... GPTs for OpenAI or 'Projects' in Anthropic, where... you as a user add examples and preload context".

    It's more about defining agents with examples and system prompts, essentially letting Mistral "pre-tune" their models for specific tasks. This lets you deploy those agents via the API or to their LeChat platform - pretty darn neat!

    Build your OWN agent - Mistral's "Agent Builder" is slick!

    While not as flashy as those recursive agents that build websites and write symphonies on their own, Mistral's take on the agent paradigm is strategic. It plays to their strengths:

    * Developer-focused: It's about creating bespoke, task-specific tools - think API integrations, code reviewers, or content generators.

    * Ease of deployment: No need for complex loop management, Mistral handles the hard parts for you!

    Mistral even teased that they'll eventually be incorporating tool use... so these "pre-tuned" agents could quickly evolve into something very interesting. ๐Ÿ˜

    NVIDIA leak about downloading videos went viral (And the Internet... Didn't Like That!)

    This week, I found myself unexpectedly at the center of an X drama explosion (fun times! ๐Ÿ˜… ) when some leaked NVIDIA Slack messages showed them discussing which YouTube channels to scrape. My crime? I dared to ask how this is different from how Google creating Street View, filming every street possible without asking for permission. My Honest Question that Sparked AI Outrage

    The Internet, as it often does, had thoughts . The tweet blew up (like a million views blew up). I was labeled an apologist, a shill, all kinds of charming things... ๐Ÿ˜‚ It got so intense, I had to MUTE the whole thing for my sanity's sake. BUT it brings up serious issues:

    * AI & Copyright: Where the Heck are the Lines? When does inspiration become infringement when a model's trained on massive datasets? There's no legal precedent, folks, which is scary .

    * Ethics vs. Innovation: AI progress moves FAST... sometimes FASTER than our ability to grasp the implications. That's unsettling.

    * Twitter Pile-Ons & Nuance (aka What NOT to do): Look, I GET being passionate. BUT when criticism turns into name-calling and mob mentality, it shuts down any chance of meaningful conversation. That's not helping ANYONE.

    Strawberry Shenanigans: Theories, Memes, and a Little AI LARPing?๐Ÿ“

    And now, for the MAIN EVENT: STRAWBERRY! You might have heard whispers... seen those cryptic tweets... maybe witnessed that wild Twitter account firsthand! It all started with Sam Altman casually posting a pic of a strawberry garden with the caption "nice summer day". Then came the deluge - more pics of strawberries from OpenAI folks, even those cryptic, semi-official domain names LDJ uncovered... I even spotted a strawberry IN OUR audience for crying out loud! This thing spread like wildfire. ๐Ÿ”ฅ

    We spent a solid chunk of the episode piecing together the lore: Q*, the mystery model shrouded in secrecy for years, then that Bloomberg leak claiming it was code-named "Strawberry", and now this. It was peak AI conspiracy-theory land!

    We still don't have hard confirmation on Q*... but that strawberry account, spitting out fruit puns and pinging ChatGPT like a maniac? Some on ThursdAI (Yam, mostly) believe that this may not have been a human at all, but an early, uncontrolled attempt to have an AI manage its own PR. ๐Ÿ˜ณ I almost bought it - especially the way it reacted to some of my live comments - but now... the LARP explanation seems more likely

    Many folks at OpenAI posted things with strawberries as well, was this a sign of something to come or were they just trying to bury the news that 3 executives departed the company this week under a mountain of ๐Ÿ“?

    Cursor & Composer: When Coding Becomes AI-Powered Magic โœจ

    I love a good tool... and this week, my dev heart was a-flutter over Cursor . Tried it yet? Seriously, you need to! It's VS Code, but SUPERCHARGED with AI that'll make you question why Copilot ever existed. ๐Ÿ˜‚

    You can edit code by CHAT, summarize entire files with one click, zap bugs instantly ... but they just dropped their ultimate weapon: Composer. It's essentially a coding AGENT that does multi-file edits. ๐Ÿคฏ

    Matt Shumer (my SaaS wizard friend who adopted Cursor early) had some jaw-dropping examples:

    " [Composer] ... takes all the parts of Cursor you like and strings them together as an agent... it takes away a lot of the grunt work... you can say 'go add this feature'... it searches your files, figures out what to edit, then puts it together. ...I literally built a SaaS in 20 minutes!" - Matt Shumer

    Matt also said that using Cursor is required at their company!

    Even my stoic PyTorch friend, Mark, couldn't help but express some curiosity:

    "It's cool they're doing things like multi-file editing... pretty curious to see more projects along those lines" - Mark Serafim

    Yeah, it's still in the rough-around-the-edges stage (UX could use some polish). But THIS, folks, is the future of coding - less about hammering out syntax, more about describing INTENT and letting the AI handle the magic! ๐Ÿคฏ I can't wait to see what they do next.

    Download at cursor.sh and let me know what you think

    Conclusion: The Future Is FAST, Open, And Maybe a Bit TOO Spicy? ๐ŸŒถ๏ธ๐Ÿ˜‚

    Honestly, every single week leaves me awestruck by how fast this AI world is moving. ๐Ÿคฏ We went from "transformers? Huh?" to 70-point math models running on SMARTWATCHES and AI building ENTIRE web apps in less than two years. And I still haven't got GPT-4's new voice model yet!!

    Open source keeps proving its power, even THOSE BIG companies are getting in on the action (look at those Google prices! ๐Ÿ˜), and then you've got those captivating mysteries keeping us on our toes... like those damned strawberries! ๐Ÿ“ What DOES OpenAI have up their sleeve??

    As always, huge THANK YOU to the amazing guests who make this show what it is - this week, extra kudos to Junyang, Nisten, LDJ, Mark, Yam, and Eric, you guys ROCK. ๐Ÿ”ฅ And HUGE gratitude to each and every ONE of you readers/listeners (and NEW folks who stuck around after those Strawberry bait tweets! ๐Ÿ˜‚) You make this ThursdAI community truly unstoppable. ๐Ÿ’ช

    Keep on building, stay insanely curious, and I'll see you next Thursday - ready or not, that AI future is coming in hot! ๐Ÿ”ฅ๐Ÿš€

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.

    This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).

    We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!

    Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but โš ๏ธ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? ๐Ÿ˜‚)

    [ Chapters ]

    00:00 Introduction to the Hosts and Their Work

    01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson

    04:12 Segment Anything 2: Overview and Capabilities

    15:33 Deep Dive: Applications and Technical Details of SAM2

    19:47 Combining SAM2 with Other Models

    36:16 Open Source AI: Importance and Future Directions

    39:59 Introduction to Distillation and DistillKit

    41:19 Introduction to DistilKit and Synthetic Data

    41:41 Distillation Techniques and Benefits

    44:10 Introducing Fernando and Distillation Basics

    44:49 Deep Dive into Distillation Process

    50:37 Open Source Contributions and Community Involvement

    52:04 ThursdAI Show Introduction and This Week's Buzz

    53:12 Weights & Biases New Course and San Francisco Meetup

    55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience

    01:08:04 SearchGPT Release and Comparison with Perplexity

    01:11:37 Apple Intelligence Release and On-Device AI Capabilities

    01:22:30 Apple Intelligence and Local AI

    01:22:44 Breaking News: Black Forest Labs Emerges

    01:24:00 Exploring the New Flux Models

    01:25:54 Open Source Diffusion Models

    01:30:50 LLM Course and Free Resources

    01:32:26 FastHTML and Python Development

    01:33:26 Friend.com: Always-On Listening Device

    01:41:16 Google Gemini 1.5 Pro Takes the Lead

    01:48:45 GitHub Models: A New Era

    01:50:01 Concluding Thoughts and Farewell

    Show Notes & Links

    * Open Source LLMs

    * Meta gives SAM-2 - segment anything with one shot + video capability! (X, Blog, DEMO)

    * Google open sources Gemma 2 2.6B (Blog, HF)

    * MTEB Arena launching on HF - Embeddings head to head (HF)

    * Arcee AI announces DistillKit - (X, Blog, Github)

    * AI Art & Diffusion & 3D

    * Black Forest Labs - FLUX new SOTA diffusion models (X, Blog, Try It)

    * Midjourney 6.1 update - greater realism + potential Grok integration (X)

    * Big CO LLMs + APIs

    * Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (X)

    * OpenAI started alpha GPT-4o voice mode (examples)

    * OpenAI releases SearchGPT (Blog, Comparison w/ PPXL)

    * Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, Intents )

    * Apple released a technical paper of apple intelligence

    * This weeks Buzz

    * AI Salons in SF + New Weave course for WandB featuring yours truly!

    * Vision & Video

    * Runway ML adds Gen -3 image to video and makes it 7x faster (X)

    * Tools & Hardware

    * Avi announces friend.com

    * Jeremy Howard releases FastHTML (Site, Video)

    * Applied LLM course from Hamel dropped all videos

    Open Source

    It feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.

    Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!

    Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".

    But wait, what IS segmentation? Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is.

    "So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video". - Piotr Skalski

    Think about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. Playground Link

    In this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!

    Okay, cool tech, BUT why is it actually USEFUL? Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level.

    "SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them" - Joseph Nelson

    AND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! ๐Ÿคฏ

    Check out Piotr's Zero Shot Florence + Sam2 Implementation

    Best of all? Apache 2 license, baby! As Joseph said, "Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall", and this is a prime example. Huge kudos to Meta for empowering us with this tech.

    The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod ๐ŸŽ™๏ธ

    Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6B

    It was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.

    So what makes Gemma 2 special? First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? ๐Ÿคจ As LDJ (one of my regular co-hosts) said on the show:

    "Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".

    Translation? It might be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.

    Gemma 2 HF Link

    And GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. Neuronpedia link

    It's Meta versus Google - round one, FIGHT! ๐ŸฅŠ

    Distilling Knowlege: Arcee AI Drops DistilKit!

    Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build distilled language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really means.

    "TLDR - we teach a smaller model to think like a bigger model"

    In a nutshell: teach a smaller model how to think like a larger one. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! ๐Ÿคฏ As Fernando eloquently put it:

    So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are observing the whole distribution of the tokens that could be sampled

    Now I admit, even after Fernando's expert breakdown, my brain still kind of melted. ๐Ÿซ  BUT, here's why this matters: distilled models are super efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! ๐Ÿคฏ Arcee is making this possible for everyone.

    Check Out DistilKit Here

    Was it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is clearly watching ThursdAI...), which makes distillation perfectly legal?

    It's wild, exciting, AND I predict a massive surge in smaller, specialized AI tools that inherit the intelligence of the big boys.

    This weeks buzz

    Did I already tell you that someone came up to me and said, hey, you're from Weights & Biases, you are the guys who make the courses right? ๐Ÿ˜‚ I said, well yeah, we have a bunch of free courses on wandb.courses but we also have a world leading ML experiment tracking software and an LLM observability toolkit among other things. It was really funny he thought we're just courses company!

    Well this last week, my incredible colleague Agata who's in charge of our courses, took an initiative and stitched together a course about Weave from a bunch of videos that I already had recorded! It's awesome, please check it out if you're interested to learn about Weave ๐Ÿ‘

    P.S - we are also starting a series of AI events in our SF office called AI Salons, the first one is going to feature Shreya Shankar, and focus on evaluations, it's on August 15th, so if you're in SF, you're invited for free as a ThursdAI subscriber! Get free tickets

    Big Co AI - LLMs & APIs

    Not only was open source popping off, but those walled-garden mega corps wanted in on the action too! SearchGPT, anyone?

    From Whispers to Reality: OpenAI Alpha Tests GPT-4 Voice (and IT'S WILD)

    This was THE moment I waited for, folks - GPT-4 with ADVANCED VOICE is finally trickling out to alpha users. Did I get access? NO. ๐Ÿ˜ฉ But my new friend, Cristiano Giardina, DID and you've probably seen his viral videos of this tech - they're blowing up MY feed, even Sam Altman retweeted the above one! I said on the show, this new voice "feels like a big next unlock for AI"

    What sets this apart from the "regular" GPT-4 voice we have now? As Cristiano told us:

    "the biggest difference is that the emotion , and the speech is very real and it follows instructions regarding emotion very well, like you can ask it to speak in a more animated way, you can ask it to be angry, sad, and it really does a good job of doing that."

    We did a LIVE DEMO (it worked, thank God), and y'all... I got CHILLS. We heard counting with a breath, depressed Soviet narrators, even a "GET TO THE CHOPPA" Schwarzenegger moment that still makes me laugh ๐Ÿ˜‚ It feels like a completely different level of interaction, something genuinely conversational and even emotional. Check out Cristiano's profile for more insane demos - you won't be disappointed.Follow Cristiano Here For Amazing Voice Mode Videos

    Can't wait for access, if anyone from OpenAI is reading this, hook me up ๐Ÿ™ I'll trade my SearchGPT access!

    SearchGPT: OpenAI Throws Their Hat Into The Ring (again?)

    Did OpenAI want to remind everyone they're STILL here amidst the LLama/Mistral frenzy? Maybe that's why they released SearchGPT - their newest "search engine that can hold a conversation" tool. Again, waitlisted, but unlike with voice mode... I got access. ๐Ÿ˜…

    The good: Fast. Really fast. And impressively competent, considering it's still a demo. Handles complex queries well, and its "follow-up" ability blows even Perplexity out of the water (which is impressive).

    The less-good: Still feels early, especially for multi-language and super local stuff. Honestly, feels more like a sneak peek of an upcoming ChatGPT integration than a standalone competitor to Google.

    But either way, it's an interesting development - as you may have already learned from my full breakdown of SearchGPT vs. Perplexity

    Apple Intelligence is here! (sort of)

    And speaking of big companies, how could I not mention the Apple Intelligence release this week? Apple finally dropped iOS 18.1 with the long-awaited ON-DEVICE intelligence, powered by the Apple Foundational Model (AFM). Privacy nerds rejoice! ๐ŸŽ‰

    But how good is it? Mixed bag, I'd say. It's there, and definitely usable for summarization, rewriting tools, text suggestions... but Siri STILL isn't hooked up to it yet, tho speech to text is way faster and she does look more beautiful. ๐Ÿค” Apple did release a ridiculously detailed paper explaining how they trained this model on Apple silicon... and as Nisten (ever the voice of honesty) said on the show,

    "It looks like they've stacked a lot of the tricks that had been working ... overall, they're not actually really doing anything new ... the important thing here is how they apply it all as a system that has access to all your personal data."

    Yeah, ouch, BUT still exciting, especially as we get closer to truly personal, on-device AI experiences. Right now, it's less about revolutionary advancements, and more about how Apple can weave this into our lives seamlessly - they're focusing heavily on app intents, meaning AI that can actually DO things for you (think scheduling appointments, drafting emails, finding that photo buried in your library). I'll keep testing this, the more I play around the more I find out, like it suddenly started suggesting replies in messages for me for example, or I haven't yet seen the filtered notifications view where it smartly only lets important messages go through your focus mode.

    So stay tuned but it's likely not worth the beta iOS upgrade if you're not a dev or a very strong enthusiast.

    Wait, MORE Breaking News?? The AI World Doesn't Sleep!

    If this episode wasn't already enough... the very day of the live show, as we're chatting, I get bombarded with breaking news alerts from my ever-vigilant listeners.

    1. Gemini 1.5 Pro 0801 - Now #1 on LMsys Arena! ๐Ÿคฏ Google apparently loves to ship big right AFTER I finish recording ThursdAI (this happened last week too!). Gemini's new version, released WHILE we were talking about older Gemini versions, claimed the top spot with an insane 1300 ELO score - crushing GPT-4 and taking home 1st place in Math, Instruction Following, and Hard Prompts! It's experimental, it's up on Google AI Studio... Go play with it! (and then tag me with your crazy findings!)

    And you know what? Some of this blog was drafted by this new model, in fact, I had the same prompt sent to Claude Sonnet 3.5, Mistral Large v2 and I tried LLama 3.1 405B but couldn't find any services that host a full context window, and this Gemini just absolutely demolished all of them on tone, on imitating my style, it even took some of the links from my TL;DR and incorporated them into the draft on its own! I've never seen any other model do that! I haven't used any LLMs so far for this blog besides proof-reading because, well they all kinda sucked, but damn, I dare you to try and find out where in this blog it was me and where it was Gemini.

    2. GitHub Does a Hugging Face: Introducing GitHub Models!

    This dropped just as we wrapped - basically a built-in marketplace where you can try, test, and deploy various models right within GitHub! They've already got LLaMa, Mistral, and some Azure-hosted GPT-4o stuff - very intriguing... Time will tell what Microsoft is cooking here, but you can bet I'll be investigating!๐Ÿ•ต๏ธ

    AI Art & Diffusion

    New Stability: Black Forest Labs and FLUX.1 Rise!

    Talk about a comeback story: 14 EX Stability AI pros led by Robin Rombach, Andreas Blatmann & Patrick Esser the OG creaters of Stable Diffusion with $31 million in funding from a16z, and are back to make diffusion dreams come true. Enter Black Forest Labs. Their first gift? FLUX.1 - a suite of text-to-image models so good, they're breaking records. I saw those demos and wow. It's good, like REALLY good. ๐Ÿคฏ

    Try it out here

    And the real bomb? They're working on open-source TEXT-TO-VIDEO! That's right, imagine generating those mind-blowing moving visuals... with code anyone can access. It's in their "Up Next" section, so watch that space - it's about to get REAL interesting.

    Also... Midjourney 6.1 also came out, and it looks GOOD

    And you can see a comparison between the two new leading models in this thread by Noah Hein

    Tools & Hardware: When AI Gets Real (And Maybe Too Real...)

    You knew I had to close this madness out with some Hardware, because hardware means that we actually are interacting with these incredible models in a meaningful way.

    Friend.com: When Your AI Is... Always Listening? ๐Ÿคจ

    And then this happened... Avi Schiffman (finally) announces friend.com. with an amazingly dystopian promo video from Sandwich. Videos. ~ 22 million views and counting, not by accident! Link to Video.

    It's basically an always-on, listening pendant. "A little like wearing a wire" as Nisten so eloquently put it. ๐ŸŽง Not for memory extension or productivity... for friendship. Target audience? Lonely people who want a device that captures and understands their entire lives, but in an almost comforting way (or maybe unsettling, depending on your viewpoint!). The debate about privacy is already RAGING... But as Nisten pointed out:

    "Overall, it is a positive. ...The entire privacy talk and data ownership, I think that's a very important conversation to have".

    I kinda get the vision. Combine THIS tech with GPT-4 Voice speed... you could actually have engaging conversations, 24/7! ๐Ÿคฏ I don't think it's as simple as "this is dystopian, end of story". Character AI is EXPLODING right now, remember those usage stats, over 20 million users and counting? The potential to help with loneliness is real...

    The Developer Corner: Tools for Those Hacking This Future

    Gotta love these shoutouts:

    * FastHTML from Jeremy Howard: Not strictly AI, but if you hate JS and love Python, this one's for you - insanely FAST web dev using a mind-bending new syntax. FastHTML website link

    * Hamel Hussain's Applied LLM Course - All Videos NOW Free!: Want to learn from some of the best minds in the field (including Jeremy Howard, Shreya Shankar evaluation QUEEN, Charles Frye and tons of other great speakers)? This course covers it all - from finetuning to Rag Building to optimizing your prompts.Applied LLMs course - free videos link

    AND ALSO ... Nisten blew everyone's minds again in the end! Remember last week, we thought it'd take time before anyone could run Llama 3.1 405B on just CPU? Well, this crazy genius already cracked the code - seven tokens per second on a normal CPU! ๐Ÿคฏ If you're a researcher who hates using cloud GPUs (or wants to use ALL THOSE CORES in your Lambda machine, wink wink)... get ready.

    Look, I'm not going to sit here and pretend that weeks are not getting crazier, it takes me longer and longer to prep for the show, and really is harder and harder to contain the show to 2 hours, and we had 3 breaking news stories just today!

    So we're accelerating, and I'll likely be using a bit of support from AI, but only if it's good, and only if it's proof read by me, so please let me know if you smell slop! I really wanna know!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Holy s**t, folks! I was off for two weeks, last week OpenAI released GPT-4o-mini and everyone was in my mentions saying, Alex, how are you missing this?? and I'm so glad I missed that last week and not this one, because while GPT-4o-mini is incredible (GPT-4o level distill with incredible speed and almost 99% cost reduction from 2 years ago?) it's not open source.

    So welcome back to ThursdAI, and buckle up because we're diving into what might just be the craziest week in open-source AI since... well, ever!

    This week, we saw Meta drop LLAMA 3.1 405B like it's hot (including updated 70B and 8B), Mistral joining the party with their Large V2, and DeepSeek quietly updating their coder V2 to blow our minds. Oh, and did I mention Google DeepMind casually solving math Olympiad problems at silver level medal ๐Ÿฅˆ? Yeah, it's been that kind of week.

    TL;DR of all topics covered:

    * Open Source

    * Meta LLama 3.1 updated models (405B, 70B, 8B) - Happy LLama Day! (X, Announcement, Zuck, Try It, Try it Faster, Evals, Provider evals)

    * Mistral Large V2 123B (X, HF, Blog, Try It)

    * DeepSeek-Coder-V2-0724 update (API only)

    * Big CO LLMs + APIs

    * ๐Ÿฅˆ Google Deepmind wins silver medal at Math Olympiad - AlphaGeometry 2 (X)

    * OpenAI teases SearchGPT - their reimagined search experience (Blog)

    * OpenAI opens GPT-4o-mini finetunes + 2 month free (X)

    * This weeks Buzz

    * I compare 5 LLama API providers for speed and quantization using Weave (X)

    * Voice & Audio

    * Daily announces a new open standard for real time Voice and Video RTVI-AI (X, Try it, Github)

    Meta LLAMA 3.1: The 405B Open Weights Frontier Model Beating GPT-4 ๐Ÿ‘‘

    Let's start with the star of the show: Meta's LLAMA 3.1. This isn't just a 0.1 update; it's a whole new beast. We're talking about a 405 billion parameter model that's not just knocking on GPT-4's door โ€“ it's kicking it down.

    Here's the kicker: you can actually download this internet scale intelligence (if you have 820GB free). That's right, a state-of-the-art model beating GPT-4 on multiple benchmarks, and you can click a download button. As I said during the show, "This is not only refreshing, it's quite incredible."

    Some highlights:

    * 128K context window (finally!)

    * MMLU score of 88.6

    * Beats GPT-4 on several benchmarks like IFEval (88.6%), GSM8K (96.8%), and ARC Challenge (96.9%)

    * Has Tool Use capabilities (also beating GPT-4) and is Multilingual (ALSO BEATING GPT-4)

    But that's just scratching the surface. Let's dive deeper into what makes LLAMA 3.1 so special.

    The Power of Open Weights

    Mark Zuckerberg himself dropped an exclusive interview with our friend Rowan Cheng from Rundown AI. And let me tell you, Zuck's commitment to open-source AI is no joke. He talked about distillation, technical details, and even released a manifesto on why open AI (the concept, not the company) is "the way forward".

    As I mentioned during the show, "The fact that this dude, like my age, I think he's younger than me... knows what they released to this level of technical detail, while running a multi billion dollar company is just incredible to me."

    Evaluation Extravaganza

    The evaluation results for LLAMA 3.1 are mind-blowing. We're not just talking about standard benchmarks here. The model is crushing it on multiple fronts:

    * MMLU (Massive Multitask Language Understanding): 88.6%

    * IFEval (Instruction Following): 88.6%

    * GSM8K (Grade School Math): 96.8%

    * ARC Challenge: 96.9%

    But it doesn't stop there. The fine folks at meta also for the first time added new categories like Tool Use (BFCL 88.5) and Multilinguality (Multilingual MGSM 91.6) (not to be confused with MultiModality which is not yet here, but soon)

    Now, these are official evaluations from Meta themselves, that we know, often don't really represent the quality of the model, so let's take a look at other, more vibey results shall we?

    On SEAL leaderboards from Scale (held back so can't be trained on) LLama 405B is beating ALL other models on Instruction Following, getting 4th at Coding and 2nd at Math tasks.

    On MixEval (the eval that approximates LMsys with 96% accuracy), my colleagues Ayush and Morgan got a whopping 66%, placing 405B just after Clause Sonnet 3.5 and above GPT-4o

    And there are more evals that all tell the same story, we have a winner here folks (see the rest of the evals in my thread roundup)

    The License Game-Changer

    Meta didn't just release a powerful model; they also updated their license to allow for synthetic data creation and distillation. This is huge for the open-source community.

    LDJ highlighted its importance: "I think this is actually pretty important because even though, like you said, a lot of people still train on OpenAI outputs anyways, there's a lot of legal departments and a lot of small, medium, and large companies that they restrict the people building and fine-tuning AI models within that company from actually being able to build the best models that they can because of these restrictions."

    This update could lead to a boom in custom models and applications across various industries as companies can start distilling, finetuning and creating synthetic datasets using these incredibly smart models.

    405B: A Double-Edged Sword

    While the 405B model is incredibly powerful, it's not exactly practical for most production use cases as you need 2 nodes of 8 H100s to run it in full precision. Despite the fact that pricing wars already started, and we see inference providers at as low as 2.7$/1M tokens, this hardly makes sense when GPT-4o mini is 15 cents.

    However, this model shines in other areas:

    * Synthetic Data Generation & Distillation: Its power and the new license make it perfect for creating high-quality training data and use it to train smaller models

    * LLM as a Judge: The model's reasoning capabilities make it an excellent candidate for evaluating other AI outputs.

    * Research and Experimentation: For pushing the boundaries of what's possible in AI.

    The Smaller Siblings: 70B and 8B

    While the 405B model is grabbing headlines, don't sleep on its smaller siblings. The 70B and 8B models got significant upgrades too.

    The 70B model saw impressive gains:

    * MMLU: 80.9 to 86

    * IFEval: 82 to 87

    * GPQA: 39 to 46

    The 8B model, in particular, could be a hidden gem. As Kyle Corbitt from OpenPipe discovered, a fine-tuned 8B model could potentially beat a prompted GPT-4 Mini in specific tasks.

    No multi-modality

    While Meta definitely addressed everything we had to ask for from the Llama 3 release, context window, incredible performance, multi-linguality, tool-use, we still haven't seen multi-modality with Llama. We still can't show it pictures or talk to it!

    However, apparently they have trained it to be mutli-modal as well but haven't yet released those weights, but they went into this in great detail in the paper and even showed a roadmap, stating that they will release it soon-ish (not in EU though)

    This Week's Buzz: Weave-ing Through LLama Providers

    In the spirit of thorough evaluation, I couldn't resist putting LLAMA 3.1 through its paces across different providers. Using Weights & Biases Weave (https://wandb.me/weave), our evaluation and tracing framework for LLMs, I ran a comparison between various LLAMA providers.

    Here's what I found:

    * Different providers are running the model with varying optimizations (VLLM, FlashAttention3, etc.)

    * Some are serving quantized versions, which can affect output style and quality

    * Latency and throughput vary significantly between providers

    The full results are available in a Weave comparison dashboard, which you can check out for a deep dive into the nuances of model deployment and code is up on Github if you want to verify this yourself or see how easy this is to do with Weave

    Mistral Crashes the Party with Large V2 123B model (X, HF, Blog, Try It)

    Just when we thought Meta had stolen the show, Mistral AI decided to drop their own bombshell: Mistral Large V2. This 123 billion parameter dense model is no joke, folks. With an MMLU score of 84.0, 128K context window and impressive performance across multiple benchmarks, it's giving LLAMA 3.1 a run for its money, especially in some coding tasks while being optimized to run on a single node!

    Especially interesting is the function calling on which they claim SOTA, without telling us which metric they used (or comparing to Llama 3.1) but are saying that they now support parallel and sequential function calling!

    DeepSeek updates DeepSeek Coder V2 to 0724

    While everyone was busy gawking at Meta and Mistral, DeepSeek quietly updated their coder model, and holy smokes, did they deliver! DeepSeek Coder v2 is now performing at GPT-4 and Claude 3.5 Sonnet levels on coding tasks. As Junyang Lin noted during our discussion, "DeepSeek Coder and DeepSeek Coder v2 should be the state of the art of the code-specific model."

    Here's the result from BigCodeBench

    and from Aider Chat (code editing dashboard)

    But it's not just about raw performance. DeepSeek is bringing some serious innovation to the table. They've added JSON mode, function calling, and even a fill-in-the-middle completion feature in beta. Plus, they've bumped up their max token generation to 8K. And let's talk about that API pricing โ€“ it's ridiculously cheap, at 14c / 1M tokens!.

    We're talking about costs that are competitive with GPT-4 Mini, but with potentially better performance on coding tasks. It's a game-changer for developers and companies looking to integrate powerful coding AI without breaking the bank.

    Google DeepMind's Math Wizardry: From Silver Medals to AI Prodigies

    Just when we thought this week couldn't get any crazier, Google DeepMind decides to casually drop a bombshell that would make even the most decorated mathletes sweat. They've created an AI system that can solve International Mathematical Olympiad (IMO) problems at a silver medalist level. I mean, come on! As if the AI world wasn't moving fast enough, now we've got silicon-based Math Olympians?

    This isn't just any run-of-the-mill calculator on steroids. We're talking about a combination of AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an upgraded version of their previous system. These AI math whizzes tackled this year's six IMO problems, covering everything from algebra to number theory, and managed to solve four of them. That's 28 points, folks - enough to bag a silver medal if it were human!

    But here's where it gets really interesting. For non-geometry problems, AlphaProof uses the Lean theorem prover, coupling a pre-trained language model with the same AlphaZero reinforcement learning algorithm that taught itself to crush humans at chess and Go. And for geometry? They've got AlphaGeometry 2, a neuro-symbolic hybrid system powered by a Gemini-based language model. It's like they've created a math genius that can not only solve problems but also explain its reasoning in a formal, verifiable way.

    The implications here are huge, folks. We're not just talking about an AI that can do your homework; we're looking at a system that could potentially advance mathematical research and proof verification in ways we've never seen before.

    OpenAI takes on Google, Perplexity (and Meta's ownership of this week) with SearchGPT waitlist (Blog)

    As I write these words, Sam posts a tweet, saying that they are launching SearchGPT, their new take on search, and as I click, I see a waitlist ๐Ÿ˜… But still, this looks so sick, just look:

    RTVI - new open standard for real time Voice and Video RTVI-AI (X, Github, Try it)

    Ok this is also great and can't be skipped, even tho this week was already insane. These models are great to text with but we want to talk to them, and while we all wait for GPT-4 Omni with voice to actually ship, we get a new contender that gives us an open standard and a killer demo!

    Daily + Groq + Cartesia + a lot of other great companies have releases this incredible demo (which you can try yourself here) and an open source standard to deliver something like a GPT-4o experience with incredible end to end latency, which feels like almost immediate responses.

    While we've chatted with Moshi previously which has these capabilities in the same model, the above uses LLama 3.1 70B even, which is an actual production grade LLM, which is a significant different from what Moshi offers. ๐Ÿ”ฅ

    Ok holy s**t, did I actually finish the writeup for this insane week? This was indeed one of the craziest weeks in Open Source AI, I honestly did NOT expect this to happen but I'm so excited to keep playing with all these tools, but also to see how the amazing open source community of finetuners will meet all these LLamas. Which I'm sure I'll be reporting on from now on until the next huge big AI breakthrough!

    Till then, see you next week, if you're listening to the podcast, please give us 5 stars on Apple podcast / Spotify? It really does help, and I'll finish with this:

    IT'S SO GOOD TO BE BACK! ๐Ÿ˜‚๐Ÿซก



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey all, Alex hereโ€ฆ well, not actually here, Iโ€™m scheduling this post in advance, which I havenโ€™t yet done, because I'm going on vacation!

    Thatโ€™s right, next week is my birthday ๐ŸŽ‰ and a much needed break, somewhere with a beach is awaiting, but I didnโ€™t want to leave you hanging for too long, so posting this episode with some amazing un-released before material.

    Mixture of Agents x2

    Back in the far away days of June 20th (not that long ago but feels like ages!), Together AI announced a new paper, released code and posted a long post about a new method to collaboration between smaller models to beat larger models. They called it Mixture of Agents, and James Zou joined us to chat about that effort.

    Shortly after that - in fact, during the live ThursdAI show, Kyle Corbitt announced that OpenPipe also researched an approached similar to the above, using different models and a bit of a different reasoning, and also went after the coveted AlpacaEval benchmark, and achieved SOTA score of 68.8 using this method.

    And I was delighted to invite both James and Kyle to chat about their respective approach the same week that both broke AlpacaEval SOTA and hear how utilizing collaboration between LLMs can significantly improve their outputs!

    This weeks buzz - what I learned at W&B this week

    So much buzz this week from the Weave team, itโ€™s hard to know what to put in here. I can start with the incredible integrations my team landed, Mistral AI, LLamaIndex, DSPy, OpenRouter and even Local Models served by Ollama, LmStudio, LLamaFile can be now auto tracked with Weave, which means you literally have to only instantiate Weave and itโ€™ll auto track everything for you!

    But I think the biggest, hugest news from this week is this great eval comparison system that the Weave Tim just pushed, itโ€™s honestly so feature rich that Iโ€™ll have to do a deeper dive on it later, but I wanted to make sure I include at least a few screencaps because I think it looks fantastic!

    Open Router - A unified interface for LLMs

    Iโ€™ve been a long time fan of OpenRouter.ai and I was very happy to have Alex Atallah on the show to talk about Open Router (even if this did happen back in April!) and Iโ€™m finally satisfied with the sound quality to released this conversation.

    Open Router is serving both foundational models like GPT, Claude, Gemini and also Open Source ones, and supports the OpenAI SDK format, making it super simple to play around and evaluate all of them on the same code. They even provide a few models for free! Right now you can use Phi for example completely free via their API.

    Alex goes deep into the areas of Open Router that I honestly didnโ€™t really know about, like being a marketplace, knowing what trendy LLMs are being used by people in near real time (check out WebSim!) and more very interesting things!

    Give that conversation a listen, Iโ€™m sure youโ€™ll enjoy it!

    Thatโ€™s it folks, no news this week, I would instead like to recommend a new newsletter by friends of the pod Tanishq Abraham and Aran Komatsuzaki both of whom are doing a weekly paper X space and recently start posting it on Substack as well!

    Itโ€™s called AI papers of the week, and if youโ€™re into papers which we donโ€™t usually cover, thereโ€™s no better duo! In fact, Tanishq often used to come to ThursdAI to explain papers so you may recognize his voice :)

    See you all in two weeks after I get some seriously needed R&R ๐Ÿ‘‹ ๐Ÿ˜Ž๐Ÿ–๏ธ



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends ๐Ÿ˜‚ Joking aside, today is a celebratory episode, 52nd consecutive weekly ThursdAI show! I've been doing this as a podcast for a year now!

    Which means, there are some of you, who've been subscribed for a year ๐Ÿ˜ฎ Thank you! Couldn't have done this without you. In the middle of my talk at AI Engineer (I still don't have the video!) I had to plug ThursdAI, and I asked the 300+ audience who is a listener of ThursdAI, and I saw a LOT of hands go up, which is honestly, still quite humbling. So again, thank you for tuning in, listening, subscribing, learning together with me and sharing with your friends!

    This week, we covered a new (soon to be) open source voice model from KyutAI, a LOT of open source LLM, from InternLM, Cognitive Computations (Eric Hartford joined us), Arcee AI (Lukas Atkins joined as well) and we have a deep dive into GraphRAG with Emil Eifrem CEO of Neo4j (who shares why it was called Neo4j in the first place, and that he's a ThursdAI listener, whaaat? ๐Ÿคฏ), this is definitely a conversation you don't want to miss, so tune in, and read a breakdown below:

    TL;DR of all topics covered:

    * Voice & Audio

    * KyutAI releases Moshi - first ever 7B end to end voice capable model (Try it)

    * Open Source LLMs

    * Microsoft Updated Phi-3-mini - almost a new model

    * InternLM 2.5 - best open source model under 12B on Hugging Face (HF, Github)

    * Microsoft open sources GraphRAG (Announcement, Github, Paper)

    * OpenAutoCoder-Agentless - SOTA on SWE Bench - 27.33% (Code, Paper)

    * Arcee AI - Arcee Agent 7B - from Qwen2 - Function / Tool use finetune (HF)

    * LMsys announces RouteLLM - a new Open Source LLM Router (Github)

    * DeepSeek Chat got an significant upgrade (Announcement)

    * Nomic GPT4all 3.0 - Local LLM (Download, Github)

    * This weeks Buzz

    * New free Prompts course from WandB in 4 days (pre sign up)

    * Big CO LLMs + APIs

    * Perplexity announces their new pro research mode (Announcement)

    * X is rolling out "Grok Analysis" button and it's BAD in "fun mode" and then paused roll out

    * Figma pauses the rollout of their AI text to design tool "Make Design" (X)

    * Vision & Video

    * Cognitive Computations drops DolphinVision-72b - VLM (HF)

    * Chat with Emil Eifrem - CEO Neo4J about GraphRAG, AI Engineer

    Voice & Audio

    KyutAI Moshi - a 7B end to end voice model (Try It, See Announcement)

    Seemingly out of nowhere, another french AI juggernaut decided to drop a major announcement, a company called KyutAI, backed by Eric Schmidt, call themselves "the first European private-initiative laboratory dedicated to open research in artificial intelligence" in a press release back in November of 2023, have quite a few rockstar co founders ex Deep Mind, Meta AI, and have Yann LeCun on their science committee.

    This week they showed their first, and honestly quite mind-blowing release, called Moshi (Japanese for Hello, Moshi Moshi), which is an end to end voice and text model, similar to GPT-4o demos we've seen, except this one is 7B parameters, and can run on your mac!

    While the utility of the model right now is not the greatest, not remotely close to anything resembling the amazing GPT-4o (which was demoed live to me and all of AI Engineer by Romain Huet) but Moshi shows very very impressive stats!

    Built by a small team during only 6 months or so of work, they have trained an LLM (Helium 7B) an Audio Codec (Mimi) a Rust inference stack and a lot more, to give insane performance.

    Model latency is 160ms and mic-to-speakers latency is 200ms, which is so fast it seems like it's too fast. The demo often responds faster than I'm able to finish my sentence, and it results in an uncanny, "reading my thoughts" type feeling.

    The most important part is this though, a quote of KyutAI post after the announcement :

    Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We believe the main impact of the project will be sharing all Moshiโ€™s secrets with the upcoming paper and open-source of the model.

    I'm really looking forward to how this tech can be applied to the incredible open source models we already have out there! Speaking to out LLMs is now officially here in the Open Source, way before we got GPT-4o and it's exciting!

    Open Source LLMs

    Microsoft stealth update Phi-3 Mini to make it almost a new model

    So stealth in fact, that I didn't even have this update in my notes for the show, but thanks to incredible community (Bartowsky, Akshay Gautam) who made sure we don't miss this, because it's so huge.

    The model used additional post-training data leading to substantial gains on instruction following and structure output. We also improve multi-turn conversation quality, explicitly support tag, and significantly improve reasoning capability

    Phi-3 June update is quite significant across the board, just look at some of these scores, 354.78% improvement in JSON structure output, 30% at GPQA

    But also specifically for coding, a 33โ†’93 jump in Java coding, 33โ†’73 in Typescript, 27โ†’ 85 in Python! These are just incredible numbers, and I definitely agree with Bartowski here, there's enough here to call this a whole new model rather than an "seasonal update"

    Qwen-2 is the start of the show right now

    Week in and week out, ThursdAI seems to be the watercooler for the best finetuners in the community to come, hang, share notes, and announce their models. A month after Qwen-2 was announced on ThursdAI stage live by friend of the pod and Qwen dev lead Junyang Lin, and a week after it re-took number 1 on the revamped open LLM leaderboard on HuggingFace, we now have great finetunes on top of Qwen-2.

    Qwen-2 is the star of the show right now. Because there's no better model. This is like GPT 4 level. It's Open Weights GPT 4. We can do what we want with it, and it's so powerful, and it's multilingual, and it's everything, it's like the dream model. I love it

    Eric Hartford - Cognitive Computations

    We've had 2 models finetunes based on Qwen 2 and their authors on the show this week, first was Lukas Atkins from Arcee AI (company behind MergeKit), they released Arcee Agent, a 7B Qwen-2 finetune/merge specifically focusing on tool use and function calling.

    We also had a chat with Eric Hartford from Cognitive Computations (which Lukas previously participated in) with the biggest open source VLM on top of Qwen-2, a 72B parameter Dolphin Vision (Trained by StableQuan, available on the HUB) ,and it's likely the biggest open source VLM that we've seen so far.

    The most exciting part about it, is Fernando Neta's "SexDrugsRockandroll" dataset, which supposedly contains, well.. a lot of uncensored stuff, and it's perfectly able to discuss and analyze images with mature and controversial content.

    InternLM 2.5 - SOTA open source under 12B with 1M context (HF, Github)

    The folks at Shanghai AI release InternLM 2.5 7B, and a chat version along with a whopping 1M context window extension. These metrics are ridiculous, beating LLama-3 8B on literally every metric on the new HF leaderboard, and even beating Llama-3 70B on MATH and coming close on GPQA!

    The folks at Intern not only released a beast of a model, but also have released a significantly imporved tool use capabilities with it, including their own agentic framework called Lagent, which comes with Code Interpreter (python execution), Search Capabilities, and of course the abilities to plug in your own tools.

    How will you serve 1M context on production you ask? Well, these folks ALSO open sourced LMDeploy, "an efficient, user-friendly toolkit designed for compressing, deploying, and serving LLM models" which has been around for a while, but is now supporting this new model of course, handles dynamic NTK and some offloading of context etc'

    So an incredible model + tools release, can't wait to play around with this!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    This weeks Buzz (What I learned with WandB this week)

    Hey, did you know we at Weights & Biases have free courses? While some folks ask you for a LOT of money for basic courses, at Weights & Biases, they are... you guessed it, completely free! And a lot of effort goes into recording and building the agenda, so I'm happy to announce that our "Developer's Guide to LLM Prompting" course is going to launch in 4 days!

    Delivered by my colleague Anish (who's just an amazing educator) and Teodora from AutogenAI, you will learn everything prompt building related, and even if you are a seasoned prompting pro, there will be something for you there! Pre-register for the course HERE

    Big CO LLMs + APIs

    How I helped roll back an XAI feature and Figma rolled back theirs

    We've covered Grok (with a K this time) from XAI multiple times, and while I don't use it's chat interface that much, or the open source model, I do think they have a huge benefit in having direct access to real time data from the X platform.

    Given that I basically live on X (to be able to deliver all these news to you) I started noticing a long promised, Grok Analysis button show up under some posts, first on mobile, then on web versions of X.

    Of course I had to test it, and whoa, I was honestly shocked at just how unhinged and profanity laced the analysis was.

    Now I'm not easily shocked, I've seen jailbroken LLMs before, I tried to get chatGPT to say curse words multiple times, but it's one thing when you expect it and a complete another thing when a billion dollar company releases a product that answers... well like this:

    Luckily Igor Babushkin (Co founder of XAI) noticed and the roll out was paused, so looks like I helped red team grok! ๐Ÿซก

    Figma pauses AI "make design" feature

    Another AI feature was paused by a big company after going viral on X (what is it about X specifically?) and this time it was Figma!

    In a super viral post, Andy Allen posted a video where he asks the new AI feature from Figma called "Make Design" a simple "weather app" and what he receives looks almost 100% identical to the iOS weather app!

    This was acknowledged by the CEO of Figma and almost immediately paused as well.

    GraphRAG... GraphRAG everywhere

    Microsoft released a pre-print paper called GraphRag (2404.16130) which talks about utilizing LLMs to first build and the use Graph databases to achieve better accuracy and performance for retrieval tasks such as "global questions directed at an entire text corpus"

    This week, Microsoft open sourced GraphRag on Github ๐Ÿ‘ and I wanted to dive a little deeper into what this actually means, as this is a concept I haven't head of before last week, and suddenly it's everywhere.

    Last week during AI Engineer, the person who first explained this concept to me (and tons of other folks in the crowd at his talk) was Emil Eifrem, CEO of Neo4J and I figured he'd be the right person to explain the whole concept in a live conversation to the audience as well, and he was!

    Emil and I (and other folks in the audience) had a great, almost 40 minute conversation about the benefits of using Graph databases for RAG, how LLMs unlocked the ability to convert unstructured data into Graph linked databases, accuracy enhancements and unlocks like reasoning over the whole corpus of data, developer experience improvements, and difficulties / challenges with this approach.

    Emil is a great communicator, with a deep understanding of this field, so I really recommend to listen to this deep dive.

    Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.

    This is it for this weeks newsletter, and a wrap on year 1 of ThursdAI as a podcast (this being out 52nd weekly release!)

    I'm going on vacation next week, but I will likely still send the TL;DR, so look out for that, and have a great independence day, and rest of your holiday weekend if you celebrate, and if you're not, I'm sure there will be cool AI things announced by the next time we meet ๐Ÿซก

    As always, appreciate your attention,

    Alex



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who Iโ€™ve just met during the last two days!)

    It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)

    On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference ๐Ÿ”ฅ (which I will bring you here in full once I get the recording!)

    On today's show, we had the awesome pleasure to have Surya Bhupatiraju who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out!

    Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA!

    This model is also now available on the Google AI studio to play with, but also on the hub!

    We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested!

    It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen.

    Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life.

    When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's!

    So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat!

    Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually thereโ€™s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference!

    Here's the TL;DR and my RAW show notes for the full show, in case it's helpful!

    * AI Engineer is happening right now in SF

    * Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference

    * Open Source LLMs

    * HuggingFace - LLM Leaderboard v2 - (Blog)

    * Old Benchmarks sucked and it's time to renew

    * New Benchmarks

    * MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)

    * GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset

    * MuSR (Multistep Soft Reasoning, paper).

    * MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)

    * IFEval (Instruction Following Evaluation, paper)

    * ๐Ÿค BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset

    * The community will be able to vote for models, and we will prioritize running models with the most votes first

    * Mozilla announces Builders Accelerator @ AI Engineer (X)

    * Theme: Local AI

    * 100K non dilutive funding

    * Google releases Gemma 2 (X, Blog)

    * Big CO LLMs + APIs

    * UMG, Sony, Warner sue Udio and Suno for copyright (X)

    * were able to recreate some songs

    * sue both companies

    * have 10 unnamed individuals who are also on the suit

    * Google Chrome Canary has Gemini nano (X)

    *

    * Super easy to use window.ai.createTextSession()

    * Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro

    * Behind a feature flag

    * Most text gen under 500ms

    * Unclear re: hardware requirements

    * Someone already built extensions

    * someone already posted this on HuggingFace

    * Anthropic Claude share-able projects (X)

    * Snapshots of Claude conversations shared with your team

    * Can share custom instructions

    * Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows

    * Projects allow users to ground Claude's outputs in their own internal knowledge and documents

    * Projects can be customized with instructions to tailor Claude's responses for specific tasks or perspectives

    * "Artifacts" feature allows users to see and interact with content generated by Claude alongside the conversation

    * Claude Team users can share their best conversations with Claude to inspire and uplevel the whole team

    * North Highland consultancy has seen 5x faster content creation and analysis using Claude

    * Anthropic is committed to user privacy and will not use shared data to train models without consent

    * Future plans include more integrations to bring in external knowledge sources for Claude

    * OpenAI voice mode update - not until Fall

    * AI Art & Diffusion & 3D

    * Fal open sourced AuraSR - a 600M upscaler based on GigaGAN (X, Fal)

    * Interview with Ethan Sutin from Bee Computer

    * We all got Bees as a gifts

    * AI Wearable that extracts TODOs, knows facts, etc'

    *



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO!

    Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced Claude Sonnet 3.5 which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ยฏ\_(ใƒ„)_/ยฏ)

    Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI! ๐Ÿ‘‡

    TL;DR of all topics covered:

    * Open Source LLMs

    * NVIDIA - Nemotron 340B - Base, Instruct and Reward model (X)

    * DeepSeek coder V2 (230B MoE, 16B) (X, HF)

    * Meta FAIR - Chameleon MMIO models (X)

    * HF + BigCodeProject are deprecating HumanEval with BigCodeBench (X, Bench)

    * NousResearch - Hermes 2 LLama3 Theta 70B - GPT-4 level OSS on MT-Bench (X, HF)

    * Big CO LLMs + APIs

    * Gemini Context Caching is available

    * Anthropic releases Sonnet 3.5 - beating GPT-4o (X, Claude.ai)

    * Ilya Sutskever starting SSI.inc - safe super intelligence (X)

    * Nvidia is the biggest company in the world by market cap

    * This weeks Buzz

    * Alex in SF next week for AIQCon, AI Engineer. ThursdAI will be sporadic but will happen!

    * W&B Weave now has support for tokens and cost + Anthropic SDK out of the box (Weave Docs)

    * Vision & Video

    * Microsoft open sources Florence 230M & 800M Vision Models (X, HF)

    * Runway Gen-3 - (t2v, i2v, v2v) Video Model (X)

    * Voice & Audio

    * Google Deepmind teases V2A video-to-audio model (Blog)

    * AI Art & Diffusion & 3D

    * Flash Diffusion for SD3 is out - Stable Diffusion 3 in 4 steps! (X)

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    ๐Ÿฆ€ New king of LLMs in town - Claude 3.5 Sonnet ๐Ÿ‘‘

    Ok so first things first, Claude Sonnet, the previously forgotten middle child of the Claude 3 family, has now received a brain upgrade!

    Achieving incredible performance on many benchmarks, this new model is 5 times cheaper than Opus at $3/1Mtok on input and $15/1Mtok on output. It's also competitive against GPT-4o and turbo on the standard benchmarks, achieving incredible scores on MMLU, HumanEval etc', but we know that those are already behind us.

    Sonnet 3.5, aka Claw'd (which is a great marketing push by the Anthropic folks, I love to see it), is beating all other models on Aider.chat code editing leaderboard, winning on the new livebench.ai leaderboard and is getting top scores on MixEval Hard, which has 96% correlation with LMsys arena.

    While benchmarks are great and all, real folks are reporting real findings of their own, here's what Friend of the Pod Pietro Skirano had to say after playing with it:

    there's like a lot of things that I saw that I had never seen before in terms of like creativity and like how much of the model, you know, actually put some of his own understanding into your request

    -@Skirano

    What's notable a capability boost is this quote from the Anthropic release blog:

    In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.

    One detail that Alex Albert from Anthropic pointed out from this released was, that on GPQA (Graduate-Level Google-Proof Q&A) Benchmark, they achieved a 67% with various prompting techniques, beating PHD experts in respective fields in this benchmarks that average 65% on this. This... this is crazy

    Beyond just the benchmarks

    This to me is a ridiculous jump because Opus was just so so good already, and Sonnet 3.5 is jumping over it with agentic solving capabilities, and also vision capabilities. Anthropic also announced that vision wise, Claw'd is significantly better than Opus at vision tasks (which, again, Opus was already great at!) and lastly, Claw'd now has a great recent cutoff time, it knows about events that happened in February 2024!

    Additionally, claude.ai got a new capability which significantly improves the use of Claude, which they call artifacts. It needs to be turned on in settings, and then Claude will have access to files, and will show you in an aside, rendered HTML, SVG files, Markdown docs, and a bunch more stuff, and it'll be able to reference different files it creates, to create assets and then a game with these assets for example!

    1 Ilya x 2 Daniels to build Safe SuperIntelligence

    Ilya Sutskever, Co-founder and failed board Coup participant (leader?) at OpenAI, has resurfaced after a long time of people wondering "where's Ilya" with one hell of an announcement.

    The company is called SSI of Safe Super Intelligence, and he's cofounding it with Daniel Levy (prev OpenAI, PHD Stanford) and Daniel Gross (AI @ Apple, AIgrant, AI Investor).

    The only mandate of this company is apparently to have a straight shot at safe super-intelligence, skipping AGI, which is no longer the buzzword (Ilya is famous for the "feel the AGI" chant within OpenAI)

    Notable also that the company will be split between Palo Alto and Tel Aviv, where they have the ability to hire top talent into a "cracked team of researchers"

    Our singular focus means no distraction by management overhead or product cycles

    Good luck to these folks!

    Open Source LLMs

    DeepSeek coder V2 (230B MoE, 16B) (X, HF)

    The folks at DeepSeek are not shy about their results, and until the Sonnet release above, have released a 230B MoE model that beats GPT4-Turbo at Coding and Math! With a great new 128K context window and an incredible open license (you can use this in production!) this model is the best open source coder in town, getting to number 3 on aider code editing and number 2 on BigCodeBench (which is a new Benchmark we covered on the pod with the maintainer, definitely worth a listen. HumanEval is old and getting irrelevant)

    Notable also that DeepSeek has launched an API service that seems to be so competitively priced that it doesn't make sense to use anything else, with $0.14/$0.28 I/O per Million Tokens, it's a whopping 42 times cheaper than Claw'd 3.5!

    Support of 338 programming languages, it should also run super quick given it's MoE architecture, the bigger model is only 21B active parameters which scales amazing on CPUs.

    They also released a tiny 16B MoE model called Lite-instruct and it's 2.4B active params.

    This weeks Buzz (What I learned with WandB this week)

    Folks, in a week, I'm going to go up on stage in front of tons of AI Engineers wearing a costume, and... it's going to be epic! I finished writing my talk, now I'm practicing and I'm very excited. If you're there, please join the Evals track ๐Ÿ™‚

    Also in W&B this week, coinciding with Claw'd release, we've added a native integration with the Anthropic Python SDK which now means that all you need to do to track your LLM calls with Claw'd is pip install weave and import weave and weave.init('your project name'

    THAT'S IT! and you get this amazing dashboard with usage tracking for all your Claw'd calls for free, it's really crazy easy, give it a try!

    Vision & Video

    Runway Gen-3 - SORA like video model announced (X, blog)

    Runway, you know the company who everyone was "sorry for" when SORA was announced by OpenAI, is not sitting around waiting to "be killed" and is announcing Gen-3, an incredible video model capable of realistic video generations, physics understanding, and a lot lot more.

    The videos took over my timeline, and this looks to my eyes better than KLING and better than Luma Dream Machine from last week, by quite a lot!

    Not to mention that Runway has been in video production for way longer than most, so they have other tools that work with this model, like motion brush, lip syncing, temporal controls and many more, that allow you to be the director of the exactly the right scene.

    Google Deepmind video-to-audio (X)

    You're going to need to turn your sound on for this one! Google has released a tease of a new model of theirs that can be paired amazingly well with the above type generative video models (of which Google also has one, that they've teased and it's coming bla bla bla)

    This one, watches your video and provides acoustic sound fitting the scene, with on-sceen action sound! They showed a few examples and honestly they look so good, a drummer playing drums and that model generated the drums sounds etc' ๐Ÿ‘ Will we ever see this as a product from google though? Nobody knows!

    Microsoft releases tiny (0.23B, 0.77B) Vision Models Florence (X, HF, Try It)

    This one is a very exciting release because it's MIT licensed, and TINY! Less than 1 Billion parameters, meaning it can completely run on device, it's a vision model, that beats MUCH bigger vision models by a significant amount on tasks like OCR, segmentation, object detection, image captioning and more!

    They have leveraged (and supposedly going to release) a FLD-5B dataset, and they have specifically made this model to be fine-tunable across these tasks, which is exciting because open source vision models are going to significantly benefit from this release almost immediately.

    Just look at this hand written OCR capability! Stellar!

    NousResearch - Hermes 2 Theta 70B - inching over GPT-4 on MT-Bench

    Teknium and the Nous Reseach crew have released a new model just to mess with me, you see, the live show was already recorded and edited, the file exported, the TL'DR written, and the newsletter draft almost ready to submit, and then I check the Green Room (DM group for all friends of the pod for ThursdAI, it's really an awesome Group Chat) and Teknium drops that they've beat GPT-4 (unsure which version) on MT-Bench with a finetune and a merge of LLama-3

    They beat Llama-3 instruct which on its own is very hard, by merging in Llama-3 instruct into their model with Charles Goddards help (merge-kit author)

    As always, these models from Nous Research are very popular, but apparently a bug at HuggingFace shows that this one is extra super duper popular, clocking in at almost 25K downloads in the past hour since release, which doesn't quite make sense ๐Ÿ˜… anyway, I'm sure this is a great one, congrats on the release friends!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Phew, somehow we covered all (most? all of the top interesting) AI news and breakthroughs of this week? Including interviews and breaking news!

    I think that this is our almost 1 year anniversary since we started putting ThursdAI on a podcast, episode #52 is coming shortly!

    Next week is going to be a big one as well, see you then, and if you enjoy these, give us a 5 start review on whatever podcast platform you're using? It really helps ๐Ÿซก



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)

    This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO)

    On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan Lecun

    Plus a new video model from Luma and finally SD3, let's go! ๐Ÿ‘‡

    TL;DR of all topics covered:

    * Apple WWDC recap and Apple Intelligence (X)

    * This Weeks Buzz

    * AI Engineer expo in SF (June 25-27) come see my talk, it's going to be Epic (X, Schedule)

    * Open Source LLMs

    * Microsoft Samba - 3.8B MAMBA + Sliding Window Attention beating Phi 3 (X, Paper)

    * Sakana AI releases LLM squared - LLMs coming up with preference algorithms to train better LLMS (X, Blog)

    * Abacus + Yan Lecun release LiveBench.AI - impossible to game benchmark (X, Bench

    * Interview with MixEval folks about achieving 96% arena accuracy with 5000x less price

    * Big CO LLMs + APIs

    * Mistral announced a 600M series B round

    * Revenue at OpenAI DOUBLED in the last 6 month and is now at $3.4B annualized (source)

    * Elon drops lawsuit vs OpenAI

    * Vision & Video

    * Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)

    * AI Art & Diffusion & 3D

    * Stable Diffusion Medium weights are here (X, HF, FAL)

    * Tools

    * Google releases GenType - create an alphabet with diffusion Models (X, Try It)

    Apple Intelligence

    Technical LLM details

    Let's dive right into what wasn't show on the keynote, in a 6 minute deep dive video from the state of the union for developers and in a follow up post on machine learning blog, Apple shared some very exciting technical details about their on device models and orchestration that will become Apple Intelligence.

    Namely, on device they have trained a bespoke 3B parameter LLM, which was trained on licensed data, and uses a bunch of very cutting edge modern techniques to achieve quite an incredible on device performance. Stuff like GQA, Speculative Decoding, a very unique type of quantization (which they claim is almost lossless)

    To maintain model , we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy โ€” averaging 3.5 bits-per-weight โ€” to achieve the same accuracy as the uncompressed models [...] on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second

    These small models (they also have a bespoke image diffusion model as well) are going to be finetuned with a lot of LORA adapters for specific tasks like Summarization, Query handling, Mail replies, Urgency and more, which gives their foundational models the ability to specialize itself on the fly to the task at hand, and be cached in memory as well for optimal performance.

    Personal and Private (including in the cloud)

    While these models are small, they will also benefit from 2 more things on device, a vector store of your stuff (contacts, recent chats, calendar, photos) they call semantic index and a new thing apple is calling App Intents, which developers can expose (and the OS apps already do) that will allows the LLM to use tools like moving files, extracting data across apps, and do actions, this already makes the AI much more personal and helpful as it has in its context things about me and what my apps can do on my phone.

    Handoff to the Private Cloud (and then to OpenAI)

    What the local 3B LLM + context can't do, it'll hand off to the cloud, in what Apple claims is a very secure way, called Private Cloud, in which they will create a new inference techniques in the cloud, on Apple Silicon, with Secure Enclave and Secure Boot, ensuring that the LLM sessions that run inference on your data are never stored, and even Apple can't access those sessions, not to mention train their LLMs on your data.

    Here are some benchmarks Apple posted for their On-Device 3B model and unknown size server model comparing it to GPT-4-Turbo (not 4o!) on unnamed benchmarks they came up with.

    In cases where Apple Intelligence cannot help you with a request (I'm still unclear when this actually would happen) IOS will now show you this dialog, suggesting you use chatGPT from OpenAI, marking a deal with OpenAI (in which apparently nobody pays nobody, so neither Apple is getting paid by OpenAI to be placed there, nor does Apple pay OpenAI for the additional compute, tokens, and inference)

    Implementations across the OS

    So what will people be able to actually do with this intelligence? I'm sure that Apple will add much more in the next versions of iOS, but at least for now, Siri is getting an LLM brain transplant and is going to be much more smarter and capable, from understanding natural speech better (and just, having better ears, the on device speech to text is improved and is really good now in IOS 18 beta) to being able to use app intents to do actions for you across several apps.

    Other features across the OS will use Apple Intelligence to prioritize your notifications, and also summarize group chats that are going off, and have built in tools for rewriting, summarizing, and turning any text anywhere into anything else. Basically think of many of the tasks you'd use chatGPT for, are now built into the OS level itself for free.

    Apple is also adding AI Art diffusion features like GenMoji (the ability to generate any emoji you can think of, like chefs kiss, or a seal with a clown nose) and while this sounds absurd, I've never been in a slack or a discord that didn't have their own unique custom emojis uploaded by their members.

    And one last feature I'll highlight is this Image Playground, Apple's take on generating images, which is not only just text, but a contextual understanding of your conversation, and let's you create with autosuggested concepts instead of just text prompts and is going to be available to all developers to bake into their apps.

    Elon is SALTY - and it's not because of privacy

    I wasn't sure if to include this segment, but in what became my most viral tweet since the beginning of this year, I posted about Elon muddying the water about what Apple actually announced, and called it a Psyop that worked. Many MSMs and definitely the narrative on X, turned into what Elon thinks about those announcements, rather than the announcements themselves and just look at this insane reach.

    We've covered Elon vs OpenAI before (a lawsuit that he actually withdrew this week, because emails came out showing he knew and was ok with OpenAI not being Open) and so it's no surprise that when Apple decided to partner with OpenAI and not say... XAI, Elon would promote absolutely incorrect and ignorant takes to take over the radio waves like he will ban apple devices from all his companies, or that OpenAI will get access to train on your iPhone data.

    This weeks BUZZ (Weights & Biases Update)

    Hey, if you're reading this, it's very likely that you've already registered or at least heard of ai.engineer and if you haven't, well I'm delighted to tell you, that we're sponsoring this awesome event in San Francisco June 25-27. Not only are we official sponsors, both Lukas (the Co-Founder and CEO) and I will be there giving talks (mine will likely be crazier than his) and we'll have a booth there, so if your'e coming, make sure to come by my talk (or Lukas's if you're a VP and are signed up for that exclusive track)

    Everyone in our corder of the world is going to be there, Swyx told me that many of the foundational models labs are coming, OpenAI, Anthropic, Google, and there's going to be tons of tracks (My talk is of course in the Evals track, come, really, I might embarrass myself on stage to eternity you don't want to miss this)

    Swyx kindly provided listeners and readers of ThursdAI with a special coupon feeltheagi so even more of a reason to try and convince your boss and come see me on stage in a costume (I've said too much!)

    Vision & Video

    Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)

    In an absolute surprise, Luma AI, a company that (used to) specialize in crafting 3D models, has released a free access video model similar to SORA, and Kling (which we covered last week) that generates 5 second videos (and doesn't require a chinese phone # haha)

    It's free to try, and supports text to video, image to video, cinematic prompt instructions, great and cohesive narrative following, character consistency and a lot more.

    Here's a comparison of the famous SORA videos and LDM (Luma Dream Machine) videos that I was provided on X by a AmebaGPT, however, worth noting that these are cherry picked SORA videos while LDM is likely a much smaller and quicker model and that folks are creating some incredible things already!

    AI Art & Diffusion & 3D

    Stable Diffusion Medium weights are here (X, HF, FAL)

    It's finally here (well, I'm using finally carefully here, and really hoping that this isn't the last thing Stability AI releases) ,the weights for Stable Diffusion 3 are available on HuggingFace! SD3 offers improved photorealism and awesome prompt adherence, like asking for multiple subjects doing multiple things.

    It's also pretty good at typography and fairly resource efficient compared to previuos versions, though I'm still waiting for the super turbo distilled versions that will likely come soon!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    And that's it for this week folks, it's been a hell of a week, I really do appreciate each and one of you who makes it to the end reading, engaging and would love to ask for feedback, so if anything didn't resonate, too long / too short, or on the podcast itself, too much info, to little info, please do share, I will take it into account ๐Ÿ™ ๐Ÿซก

    Also, we're coming up to the 52nd week I've been sending these, which will mark ThursdAI BirthdAI for real (the previous one was for the live shows) and I'm very humbled that so many of you are now reading, sharing and enjoying learning about AI together with me ๐Ÿ™

    See you next week,

    Alex



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey hey! This is Alex! ๐Ÿ‘‹

    Some podcasts have 1 or maaaybe 2 guests an episode, we had 6! guests today, each has had an announcement, an open source release, or a breaking news story that we've covered! (PS, this edition is very multimodal so click into the Substack as videos don't play in your inbox)

    As you know my favorite thing is to host the folks who make the news to let them do their own announcements, but also, hitting that BREAKING NEWS button when something is actually breaking (as in, happened just before or during the show) and I've actually used it 3 times this show!

    It's not every week that we get to announce a NEW SOTA open model with the team that worked on it. Junyang (Justin) Lin from Qwen is a friend of the pod, a frequent co-host, and today gave us the breaking news of this month, as Qwen2 72B, is beating LLama-3 70B on most benchmarks! That's right, a new state of the art open LLM was announced on the show, and Justin went deep into details ๐Ÿ‘ (so don't miss this conversation, listen to wherever you get your podcasts)

    We also chatted about SOTA multimodal embeddings with Jina folks (Bo Wand and Han Xiao) and Zach from Nomic, dove into an open source compute grant with FALs Batuhan Taskaya and much more!

    TL;DR of all topics covered:

    * Open Source LLMs

    * Alibaba announces Qwen 2 - 5 model suite (X, HF)

    * Jina announces Jina-Clip V1 - multimodal embeddings beating CLIP from OAI (X, Blog, Web Demo)

    * Nomic announces Nomic-Embed-Vision (X, BLOG)

    * MixEval - arena style rankings with Chatbot Arena model rankings with 2000ร— less time (5 minutes) and 5000ร— less cost ($0.6) (X, Blog)

    * Vision & Video

    * Kling - open access video model SORA competitor from China (X)

    * This Weeks Buzz

    * WandB supports Mistral new finetuning service (X)

    * Register to my June 12 workshop on building Evals with Weave HERE

    * Voice & Audio

    * StableAudio Open - X, BLOG, TRY IT

    * Suno launches "upload your audio" feature to select few - X

    * Udio - upload your own audio feature - X

    * AI Art & Diffusion & 3D

    * Stable Diffusion 3 weights are coming on June 12th (Blog)

    * JasperAI releases Flash Diffusion (X, TRY IT, Blog)

    * Big CO LLMs + APIs

    * Group of ex-OpenAI sign a new letter - righttowarn.ai

    * A hacker releases TotalRecall - a tool to extract all the info from MS Recall Feature (Github)

    Open Source LLMs

    QWEN 2 - new SOTA open model from Alibaba (X, HF)

    This is definitely the biggest news for this week, as the folks at Alibaba released a very surprising and super high quality suite of models, spanning from a tiny 0.5B model to a new leader in open models, Qwen 2 72B

    To add to the distance from Llama-3, these new models support a wide range of context length, all large, with 7B and 72B support up to 128K context.

    Justin mentioned on stage that actually finding sequences of longer context lengths is challenging, and this is why they are only at 128K.

    In terms of advancements, the highlight is advanced Code and Math capabilities, which are likely to contribute to overall model advancements across other benchmarks as well.

    It's also important to note that all models (besides the 72B) are now released with Apache 2 license to help folks actually use globally, and speaking of globality, these models have been natively trained with 27 additional languages, making them considerably better at multilingual prompts!

    One additional amazing thing was, that a finetune was released by Eric Hartford and Cognitive Computations team, and AFAIK this is the first time a new model drops with an external finetune. Justing literally said "It is quite amazing. I don't know how they did that. Well, our teammates don't know how they did that, but, uh, it is really amazing when they use the Dolphin dataset to train it."

    Here's the Dolphin finetune metrics and you can try it out here

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Jina-Clip V1 and Nomic-Embed-Vision SOTA multimodal embeddings

    It's quite remarkable that we got 2 separate SOTA of a similar thing during the same week, and even more cool that both companies came to talk about it on ThursdAI!

    First we welcomed back Bo Wang from Jina (who joined by Han Xiao the CEO) and Bo talked about multimodal embeddings that beat OpenAI CLIP (which both conceded was a very low plank)

    Jina Clip V1 is apache 2 open sourced, while Nomic Embed is beating it on benchmarks but is CC-BY-NC non commercially licensed, but in most cases, if you're embedding, you'd likely use an API, and both companies offer these embeddings via their respective APIs

    One thing to note about Nomic, is that they have mentioned that these new embeddings are backwards compatible with the awesome Nomic embed endpoints and embeddings, so if you've used that, now you've gone multimodal!

    Because these models are fairly small, there are now web versions, thanks to transformer.js, of Jina and Nomic Embed (caution, this will download large-ish files) built by non other than our friend Xenova.

    If you're building any type of multimodal semantic search, these two embeddings systems are now open up all your RAG needs for multi modal data!

    This weeks Buzz (What I learned with WandB this week)

    Mistral announced built in finetuning server support, and has a simple WandB integration! (X)

    Also, my workshop about building evals 101 is coming up next week, June 12, excited to share with you a workshop that we wrote for in person crowd, please register here

    and hope to see you next week!

    Vision & Video

    New SORA like video generation model called KLING in open access (DEMO)

    This one has to be seen to be believed, out of nowhere, an obscure (to us) chinese company kuaishou.com dropped a landing page with tons of videos that are clearly AI generated, and they all look very close to SORA quality, way surpassing everything else we've seen in this category (Runaway, Pika, SVD)

    And they claim that they offer support for it via their App (but you need apparently a Chinese phone number, so not for me)

    It's really hard to believe that this quality exists already outside of a frontier lab full of GPUs like OpenAI and it's now in waitlist mode, where as SORA is "coming soon"

    Voice & Audio

    Stability open sources Stable Audio Open (X, BLOG, TRY IT)

    A new open model from Stability is always fun, and while we wait for SD3 to drop weights (June 12! we finally have a date) we get this awesome model from Dadabots at team at Stability.

    It's able to generate 47s seconds of music, and is awesome at generating loops, drums and other non vocal stuff, so not quite where Suno/Udio are, but the samples are very clean and sound very good. Prompt: New York Subway

    They focus the model on being able to get Finetuned on a specific drummers style for example, and have it be open and specialize in samples, and sound effects and not focused on melodies or finalized full songs but it has some decent skills in simple prompts, like "progressive house music"

    This model has a non commercial license and can be played with here

    Suno & Udio let users upload their own audio!

    This one is big, so big in fact, that I am very surprised that both companies announced this exact feature the same week.

    Suno has reached out to me and a bunch of other creators, and told us that we are now able to upload our own clips, be it someone playing solo guitar, or even whistling and have Suno remix it into a real proper song.

    In this example, this is a very viral video, this guy sings at a market selling fish (to ladies?) and Suno was able to create this remix for me, with the drop, the changes in his voice, the melody, everything, itโ€™s quite remarkable!

    AI Art & Diffusion

    Flash Diffusion from JasperAI / Clipdrop team (X, TRY IT, Blog, Paper)

    Last but definitely not least, we now have a banger of a diffusion update, from the Clipdrop team (who was amazing things before Stability bought them and then sold them to JasperAI)

    Diffusion models likle Stable Diffusion often take 30-40 inference steps to get you the image, searching for your prompt through latent space you know?

    Well recently there have been tons of these new distill methonds, models that are like students, who learn from the teacher model (Stable Diffusion XL for example) and distill the same down to a few steps (sometimes as low as 2!)

    Often the results are, distilled models that can run in real time, like SDXL Turbo, Lightning SDXL etc

    Now Flash Diffusion achieves State-of-the-Art (SOTA) performance metrics, specifically in terms of Frรฉchet Inception Distance (FID) and CLIP Score. These metrics are the default for evaluating the quality and relevance of generated images.

    And Jasper has open sourced the whole training code to allow for reproducibility which is very welcome!

    Flash diffusion also comes in not only image generation, but also inpaining and upscaling, allowing it to be applied to other methods to speed them up as well.

    โ€”

    This is all for this week, I mean, there are TONS more stuff we could have covered, and we did mention them on the pod, but I aim to serve as a filter to the most interesting things as well so, until next week ๐Ÿซก



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey everyone, Alex here!

    Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting!

    I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try ๐Ÿ™‚ It's impossible for me to add the density of knowledge that's being shared on stage for 2 hours here in the newsletter!

    Also, before we dive in, Iโ€™m hosting a free workshop soon, about building evaluations from scratch, if youโ€™re building anything with LLMs in production, more than welcome to join us on June 12th (itโ€™ll be virtual)

    TL;DR of all topics covered:

    * Open Source LLMs

    * Mistral open weights Codestral - 22B dense coding model (X, Blog)

    * Nvidia open sources NV-Embed-v1 - Mistral based SOTA embeddings (X, HF)

    * HuggingFace Chat with tool support (X, demo)

    * Aider beats SOTA on Swe-Bench with 26% (X, Blog, Github)

    * OpenChat - Sota finetune of Llama3 (X, HF, Try It)

    * LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)

    * Big CO LLMs + APIs

    * Scale announces SEAL Leaderboards - with private Evals (X, leaderboard)

    * SambaNova achieves >1000T/s on Llama-3 full precision

    * Groq hits back with breaking 1200T/s on Llama-3

    * Anthropic tool support in GA (X, Blogpost)

    * OpenAI adds GPT4o, Web Search, Vision, Code Interpreter & more to free users (X)

    * Google Gemini & Gemini Flash are topping the evals leaderboards, in GA(X)

    * Gemini Flash finetuning coming soon

    * This weeks Buzz (What I learned at WandB this week)

    * Sponsored a Mistral hackathon in Paris

    * We have an upcoming workshop in 2 parts - come learn with me

    * Vision & Video

    * LLama3-V - Sota OSS VLM (X, Github)

    * Voice & Audio

    * Cartesia AI - super fast SSM based TTS with very good sounding voices (X, Demo)

    * Tools & Hardware

    * Jina Reader (https://jina.ai/reader/)

    * Co-Hosts and Guests

    * Rodrigo Liang (@RodrigoLiang) & Anton McGonnell (@aton2006) from SambaNova

    * Itamar Friedman (@itamar_mar) Codium

    * Arjun Desai (@jundesai) - Cartesia

    * Nisten Tahiraj (@nisten) - Cohost

    * Wolfram Ravenwolf (@WolframRvnwlf)

    * Eric Hartford (@erhartford)

    * Maziyar Panahi (@MaziyarPanahi)

    Scale SEAL leaderboards (Leaderboard)

    Scale AI has announced their new initiative, called SEAL leaderboards, which aims to provide yet another point of reference in how we understand frontier models and their performance against each other.

    We've of course been sharing LMSys arena rankings here, and openLLM leaderboard from HuggingFace, however, there are issues with both these approaches, and Scale is approaching the measuring in a different way, focusing on very private benchmarks and dataset curated by their experts (Like Riley Goodside)

    The focus of SEAL is private and novel assessments across Coding, Instruction Following, Math, Spanish and more, and the main reason they keep this private, is so that models won't be able to train on these benchmarks if they leak to the web, and thus show better performance due to data contamination.

    They are also using ELO scores (Bradley-Terry) and I love this footnote from the actual website:

    "To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts"

    This means they are taking the contamination thing very seriously and it's great to see such dedication to being a trusted source in this space.

    Specifically interesting also that on their benchmarks, GPT-4o is not better than Turbo at coding, and definitely not by 100 points like it was announced by LMSys and OpenAI when they released it!

    Gemini 1.5 Flash (and Pro) in GA and showing impressive performance

    As you may remember from my Google IO recap, I was really impressed with Gemini Flash, and I felt that it went under the radar for many folks. Given it's throughput speed, 1M context window, and multimodality and price tier, I strongly believed that Google was onto something here.

    Well this week, not only was I proven right, I didn't actually realize how right I was ๐Ÿ™‚ as we heard breaking news from Logan Kilpatrick during the show, that the models are now in GA, and that Gemini Flash gets upgraded to 1000 RPM (requests per minute) and announced that finetuning is coming and will be free of charge!

    Not only with finetuning won't cost you anything, inference on your tuned model is going to cost the same, which is very impressive.

    There was a sneaky price adjustment from the announced pricing to the GA pricing that upped the pricing by 2x on output tokens, but even despite that, Gemini Flash with $0.35/1MTok for input and $1.05/1MTok on output is probably the best deal there is right now for LLMs of this level.

    This week it was also confirmed both on LMsys, and on Scale SEAL leaderboards that Gemini Flash is a very good coding LLM, beating Claude Sonnet and LLama-3 70B!

    SambaNova + Groq competing at 1000T/s speeds

    What a week for inference speeds!

    SambaNova (an AI startup with $1.1B in investment from Google Ventures, Intel Capital, Samsung, Softbank founded in 2017) has announced that they broke the 1000T/s inference barrier on Llama-3-8B in full precision mode (suing their custom hardware called RDU (reconfigurable dataflow unit)

    As you can see, this is incredible fast, really, try it yourself here.

    Seeing this, the folks at Groq, who had the previous record on super fast inference (as I reported just in February) decided to not let this slide, and released an incredible 20% improvement on their own inference of LLama-3-8B, getting to 1200Ts, showing that they are very competitive.

    This bump in throughput is really significant, many inference providers that use GPUs, and not even hitting 200T/s, and Groq improved their inference by that amount within 1 day of being challenged.

    I had the awesome pleasure to have Rodrigo the CEO on the show this week to chat about SambaNova and this incredible achievement, their ability to run this in full precision, and future plans, so definitely give it a listen.

    This weeks Buzz (What I learned with WandB this week)

    This week was buzzing at Weights & Biases! After co-hosting a Hackathon with Meta a few weeks ago, we cohosted another Hackathon, this time with Mistral, in Paris. (where we also announced our new integration with their Finetuning!)

    The organizers Cerebral Valley have invited us to participate and it was amazing to see the many projects that use WandB and Weave in their finetuning presentations, including a friend of the pod Maziyar Panahi who's team nabbed 2nd place (you can read about their project here) ๐Ÿ‘

    Also, I'm going to do a virtual workshop together with my colleague Anish, about prompting and building evals, something we know a thing or two about, it's free and I would very much love to invite you to register and learn with us!

    Cartesia AI (try it)

    Hot off the press, we're getting a new Audio TTS model, based on the State Space model architecture (remember Mamba?) from a new startup called Cartesia AI, who aim to bring real time intelligence to on device compute!

    The most astonishing thing they released was actually the speed with which they model starts to generate voices, under 150ms, which is effectively instant, and it's a joy to play with their playground, just look at how fast it started generating this intro I recorded using their awesome 1920's radio host voice

    Co-founded by Albert Gu, Karan Goel and Arjun Desai (who joined the pod this week) they have shown incredible performance but also showed that transformer alternative architectures like SSMs can really be beneficial for audio specifically, just look at this quote!

    On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).

    With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x)

    In Open Source news:

    Mistral released Codestral 22B - their flagship code model with a new non commercial license

    Codestral is now available under the new Mistral license for non-commercial R&D use. With a larger context window of 32K, Codestral outperforms all other models in RepoBench, a long-range evaluation for code generation. Its fill-in-the-middle capability is favorably compared to DeepSeek Coder 33B.

    Codestral is supported in VSCode via a plugin and is accessible through their API, Le Platforme, and Le Chat.

    HuggingFace Chat with tool support (X, demo)

    This one is really cool, HF added Cohere's Command R+ with tool support and the tools are using other HF spaces (with ZeroGPU) to add capabilities like image gen, image editing, web search and more!

    LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)

    The awesome team at LLM 360 released K2 65B, which is an open source model that comes very close to LLama70B on benchmarks, but the the most important thing, is that they open source everything, from code, to datasets, to technical write-ups, they even open sourced their WandB plots ๐Ÿ‘

    This is so important to the open source community, that we must highlight and acknowledge the awesome effort from LLM360 ai of doing as much open source!

    Tools - Jina reader

    In the tools category, while we haven't discussed this on the pod, I really wanted to highlight Jina reader. We've had Bo from Jina AI talk to us about Embeddings in the past episodes, and since then Jina folks released this awesome tool that's able to take any URL and parse it in a nice markdown format that's very digestable to LLMs.

    You can pass any url, and it even does vision understanding! And today they released PDF understanding as well so you can pass the reader PDF files and have it return a nicely formatted text!

    The best part, it's free! (for now at least!)

    And thatโ€™s a wrap for today, see you guys next week, and if you found any of this interesting, please share with a friend ๐Ÿ™



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD.

    This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices!

    While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend Joe Heitzberg who operates and runs the aitinkerers.org (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I

    Ok let's get to the news:

    TL;DR of all topics covered:

    * Open Source LLMs

    * HuggingFace commits 10M in ZeroGPU (X)

    * Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)

    * Mistral 7B 0.3 - Base + Instruct (HF)

    * LMSys created a "hard prompts" category (X)

    * Cohere for AI releases Aya 23 - 3 models, 101 languages, (X)

    * Big CO LLMs + APIs

    * Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere

    * Will post a dedicated episode to this on Sunday

    * OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained

    * Microsoft AI PCs - Copilot+ PCs (Blog)

    * Anthropic - Scaling Monosemanticity paper - about mapping the features of an LLM (X, Paper)

    * Vision & Video

    * OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace)

    * Voice & Audio

    * OpenAI pauses Sky voice due to ScarJo hiring legal counsel

    * Tools & Hardware

    * Humane is looking to sell (blog)

    Open Source LLMs

    Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)

    Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3.

    All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick.

    Those models have MIT license, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great!

    LMsys just updated their arena to include a hard prompts category (X) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5 ๐Ÿ˜ฎ Can't wait to see how the small and medium versions perform on the arena.

    Mistral gives us function calling in Mistral 0.3 update (HF)

    Just in time for the Mistral hackathon in Paris, Mistral has released an update to the 7B model (and likely will update the MoE 8x7B and 8x22B Mixtrals) with function calling and a new vocab.

    This is awesome all around because function calling is important for agenting capabilities, and it's about time all companies have it, and apparently the way Mistral has it built in matches the Cohere Command R way and is already supported in Ollama, using raw mode.

    Big CO LLMs + APIs

    Open AI is not having a good week - Sky voice has paused, Employees complain

    OpenAI is in hot waters this week, starting with pausing the Sky voice (arguably the best most natural sounding voice out of the ones that launched) due to complains for Scarlett Johansson about this voice being similar to hers. Scarlett appearance in the movie Her, and Sam Altman tweeting "her" to celebrate the release of the incredible GPT-4o voice mode were all talked about when ScarJo has released a statement saying she was shocked when her friends and family told her that OpenAI's new voice mode sounds just like her.

    Spoiler, it doesn't really, and they hired an actress and have had this voice out since September last year, as they outlined in their blog following ScarJo complaint.

    Now, whether or not there's legal precedent here, given that Sam Altman reached out to Scarlet twice, including once a few days before the event, I won't speculate, but for me, personally, not only Sky doesn't sound like ScarJo, it was my favorite voice even before they demoed it, and I'm really sad that it's paused, and I think it's unfair to the actress who was hired for her voice. See her own statement:

    Microsoft Build - CoPilot all the things

    I have recorded a Built recap with Ryan Carson from Intel AI and will be posting that as it's own episode on Sunday, so look forward to that, but for now, here are the highlights from BUILD:

    * Copilot everywhere, Microsoft builds the CoPilot as a platform

    * AI native laptops with NPU chips for local AI

    * Recall an on device AI that let's you search through everything you saw or typed with natural language

    * Github Copilot Workspace + Extensions

    * Microsoft stepping into education with sponsoring Khan Academy free for all teaches in the US

    * Copilot Team member and Agent - Copilot will do things proactively as your team member

    * GPT-4o voice mode is coming to windows and to websites!

    Hey, if you like reading this, can you share with 1 friend? Itโ€™ll be an awesome way to support this pod/newsletter!

    Anthropic releases the Scaling Monosemanticity paper

    This is quite a big thing that happened this week for Mechanistic Interpretability and Alignment, with Anthropic releasing a new paper and examples of their understanding of what LLM "thinks".

    They have done incredible work in this area, and now they have scaled it up all the way to production models like Claude Haiku, which shows that this work can actually understand which "features" are causing which tokens to output.

    In the work they highlighted features such as "deception", "bad code" and even a funny one called "Golden Gate bridge" and showed that clamping these features can affect the model outcomes.

    One these features have been identified, they can be turned on or off with various levels of power, for example they turned up the Golden Gate Bridge feature up to the maximum, and the model thought it was the Golden Gate bridge.

    While a funny example, they also found features for racism, bad / wrong code, inner conflict, gender bias, sycophancy and more, you can play around with some examples here and definitely read the full blog if this interests you, but overall it shows incredible promise in alignment and steer-ability of models going forward on large scale

    This weeks Buzz (What I learned with WandB this week)

    I was demoing Weave all week long in Seattle, first at the AI Tinkerers event, and then at MSFT BUILD.

    They had me record a pre-recorded video of my talk, and then have a 5 minute demo on stage, which (was not stressful at all!) so here's the pre-recorded video that turned out really good!

    Also, we're sponsoring the Mistral Hackathon this weekend in Paris, so if you're in EU and want to hack with us, please go, it's hosted by Cerebral Valley and HuggingFace and us โ†’

    Vision

    Phi-3 mini Vision

    In addition to Phi-3 small and Phi-3 Medium, Microsoft released Phi-3 mini with vision, which does an incredible job understanding text and images! (You can demo it right here)

    Interestingly, the Phi-3 mini with vision has 128K context window which is amazing and even beats Mistral 7B as a language model! Give it a try

    OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace, Demo)

    Two state of the art vision models in one week? well that's incredible. A company I haven't heard of OpenBNB have released MiniCPM 7B trained on top of LLama3 and they claim that they outperform the Phi-3 vision

    They claim that it has GPT-4 vision level performance and achieving an 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro

    In my tests, Phi-3 performed a bit better, I showed both the same picture, and Phi was more factual on the hard prompts:

    Phi-3 Vision:

    And that's it for this week's newsletter, look out for the Sunday special full MSFT Build recap and definitely give the whole talk a listen, it's full of my co-hosts and their great analysis of this weeks events!



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)

    I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well.

    As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section.

    Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output.

    A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free!

    That's GPT-4 level intelligence, for free for everyone, without having to log in!

    What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well.

    So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why:

    Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as well

    One last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win.

    OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit.

    Google is BACK, backer than ever, Alex's Google IO recap

    On Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us.

    TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best.

    "Google with do the Googling for you"

    Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? ๐Ÿค”) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)

    Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example.

    Kind of hidden but important was a caching announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM activation layers for most of your context, so a developer won't have to pay for repeatedly sending the same thing over and over again (which happens in most chat applications) and will significantly speed up work with larger context windows.

    They also mentioned Gemini Nano, a on device Gemini, that's also multimodal, that can monitor calls in real time for example for older folks, and alert them about being scammed, and one of the cooler announcements was, Nano is going to be baked into the Chrome browser.

    With Gemma's being upgraded, there's not a product at Google that Gemini is not going to get infused into, and while they counted 131 "AI" mentions during the keynote, I'm pretty sure Gemini was mentioned way more!

    Project Astra - A universal AI agent helpful in everyday life

    After a few of the announcements from Sundar, (newly knighted) Sir Demis Hassabis came out and talked about DeepMind research, AlphaFold 3 and then turned to project Astra.

    This demo was really cool and kind of similar to the GPT-4o conversation, but also different. I'll let you just watch it yourself:

    TK: project astra demo

    And this is no fake, they actually had booths with Project Astra test stations and I got to chat with it (I came back 3 times) and had a personal demo from Josh Woodward (VP of Labs) and it works, and works fast! It sometimes disconnects and sometimes there are misunderstandings, like when multiple folks are speaking, but overall it's very very impressive.

    If you remember the infamous video with the rubber ducky that was edited by Google and caused a major uproar when we found out? It's basically that, on steroids, and real and quite quite fast.

    Astra has a decent short term memory, so if you ask it where something was, it will remember, and Google cleverly used that trick to also show that they are working on augmented reality glasses with Astra built in, which would make amazing sense.

    Open Source LLMs

    Google open sourced PaliGemma VLM

    Giving us something in the open source department, adding to previous models like RecurrentGemma, Google has uploaded a whopping 116 different checkpoints of a new VLM called PaliGemma to the hub, which is a State of the Art vision model at 3B.

    It's optimized for finetuning for different workloads such as Visual Q&A, Image and short video captioning and even segmentation!

    They also mentioned that Gemma 2 is coming next month, will be a 27B parameter model that's optimized to run on a single TPU/GPU.

    Nous Research Hermes 2 ฮ˜ (Theta) - their first Merge!

    Collaborating with Charles Goddard from Arcee (the creators of MergeKit), Teknium and friends merged the recently trained Hermes 2 Pro with Llama 3 instruct to get a model that's well performant on all the tasks that LLama-3 is good at, while maintaining capabilities of Hermes (function calling, Json mode)

    Yi releases 1.5 with apache 2 license

    The folks at 01.ai release Yi 1.5, with 6B, 9B and 34B (base and chat finetunes)

    Showing decent benchmarks on Math and Chinese, 34B beats LLama on some of these tasks while being 2x smaller, which is very impressive

    This weeks Buzz - LLama3 hackathon with Meta

    Before all the craziness that was announced this week, I participated and judged the first ever Llama-3 hackathon. It was quite incredible, with over 350 hackers participating, Groq, Lambda, Meta, Ollama and others sponsoring and giving talks and workshops it was an incredible 24 hours at Shak 15 in SF (where Cerebral Valley hosts their hackathons)

    Winning hacks were really innovative, ranging from completely open source smart glasses for under 20$, to a LLM debate platform with an LLM judge on any moral issue, and one project that was able to jailbreak llama by doing some advanced LLM arithmetic. Kudos to the teams for winning, and it was amazing to see how many of them adopted Weave as their observability framework as it was really easy to integrate.

    Oh and I got to co-judge with the ๐Ÿ of HuggingFace

    This is all the notes for this week, even though there was a LOT lot more, check out the TL;DR and see you here next week, which I'll be recording from Seattle, where I'll be participating in the Microsoft BUILD event, so we'll see Microsoft's answer to Google IO as well. If you're coming to BUILD, come by our booth and give me a high five!

    TL;DR of all topics covered:

    * OpenAI Announcements

    * GPT-4o

    * Voice mode

    * Desktop App

    * Google IO recap:

    * Google Gemini

    * Gemini 1.5 Pro: Available globally to developers with a 2-million-token context window, enabling it to handle larger and more complex tasks.

    * Gemini 1.5 Flash: A faster and less expensive version of Gemini, optimized for tasks requiring low latency.

    * Gemini Nano with Multimodality: An on-device model that processes various inputs like text, photos, audio, web content, and social videos.

    * Project Astra: An AI agent capable of understanding and responding to live video and audio in real-time.

    * Google Search

    * AI Overviews in Search Results: Provides quick summaries and relevant information for complex search queries.

    * Video Search with AI: Allows users to search by recording a video, with Google's AI processing it to pull up relevant answers.

    * Google Workspace

    * Gemini-powered features in Gmail, Docs, Sheets, and Meet: Including summarizing conversations, providing meeting highlights, and processing data requests.

    * "Chip": An AI teammate in Google Chat that assists with various tasks by accessing information across Google services.

    * Google Photos

    * "Ask Photos": Allows users to search for specific items in photos using natural language queries, powered by Gemini.

    * Video Generation

    * Veo Generative Video: Creates 1080p videos from text prompts, offering cinematic effects and editing capabilities.

    * Other Notable AI Announcements

    * NotebookLM: An AI tool to organize and interact with various types of information (documents, PDFs, notes, etc.), allowing users to ask questions about the combined information.

    * Video Overviews (Prototyping): A feature within NotebookLM that generates audio summaries from uploaded documents.

    * Code VR: A generative video AI model capable of creating high-quality videos from various prompts.

    * AI Agents: A demonstration showcasing how AI agents could automate tasks across different software and systems.

    * Generative Music: Advancements in AI music generation were implied but not detailed.

    * Open Source LLMs

    * Google PaliGemma 3B - sota open base VLM (Blog)

    * Gemma 2 - 27B coming next month

    * Hermes 2 ฮ˜ (Theta) - Merge of Hermes Pro & Llama-instruct (X, HF)

    * Yi 1.5 - Apache 2 licensed 6B, 9B and 34B (X)

    * Tiger Lab - MMLU-pro - a harder MMLU with 12K questions (X, HuggingFace)

    * This weeks Buzz (What I learned with WandB this week)

    * Llama3 hackathon with Meta, Cerebral Valley, HuggingFace and Weights & Biases

    * Vision & Video

    * Google announces VEO - High quality cinematic generative video generation (X)

    * AI Art & Diffusion & 3D

    * Google announces Imagen3 - their latest Gen AI art model (Blog)

    * Tools

    * Cursor trained a model that does 1000tokens/s and editing ๐Ÿ˜ฎ (X)



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • Hey ๐Ÿ‘‹ (show notes and links a bit below)

    This week has been a great AI week, however, it does feel like a bit "quiet before the storm" with Google I/O on Tuesday next week (which I'll be covering from the ground in Shoreline!) and rumors that OpenAI is not just going to let Google have all the spotlight!

    Early this week, we got 2 new models on LMsys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and we've now confirmed that they are from OpenAI, and folks have been testing them with logic puzzles, role play and have been saying great things, so maybe that's what we'll get from OpenAI soon?

    Also on the show today, we had a BUNCH of guests, and as you know, I love chatting with the folks who make the news, so we've been honored to host Xingyao Wang and Graham Neubig core maintainers of Open Devin (which just broke SOTA on Swe-Bench this week!) and then we had friends of the pod Tanishq Abraham and Parmita Mishra dive deep into AlphaFold 3 from Google (both are medical / bio experts).

    Also this week, OpenUI from Chris Van Pelt (Co-founder & CIO at Weights & Biases) has been blowing up, taking #1 Github trending spot, and I had the pleasure to invite Chris and chat about it on the show!

    Let's delve into this (yes, this is I, Alex the human, using Delve as a joke, don't get triggered ๐Ÿ˜‰)

    TL;DR of all topics covered (trying something new, my Raw notes with all the links and bulletpoints are at the end of the newsletter)

    * Open Source LLMs

    * OpenDevin getting SOTA on Swe-Bench with 21% (X, Blog)

    * DeepSeek V2 - 236B (21B Active) MoE (X, Try It)

    * Weights & Biases OpenUI blows over 11K stars (X, Github, Try It)

    * LLama-3 120B Chonker Merge from Maxime Labonne (X, HF)

    * Alignment Lab open sources Buzz - 31M rows training dataset (X, HF)

    * xLSTM - new transformer alternative (X, Paper, Critique)

    * Benchmarks & Eval updates

    * LLama-3 still in 6th place (LMsys analysis)

    * Reka Core gets awesome 7th place and Qwen-Max breaks top 10 (X)

    * No upsets in LLM leaderboard

    * Big CO LLMs + APIs

    * Google DeepMind announces AlphaFold-3 (Paper, Announcement)

    * OpenAI publishes their Model Spec (Spec)

    * OpenAI tests 2 models on LMsys (im-also-a-good-gpt2-chatbot & im-a-good-gpt2-chatbot)

    * OpenAI joins Coalition for Content Provenance and Authenticity (Blog)

    * Voice & Audio

    * Udio adds in-painting - change parts of songs (X)

    * 11Labs joins the AI Audio race (X)

    * AI Art & Diffusion & 3D

    * ByteDance PuLID - new high quality ID customization (Demo, Github, Paper)

    * Tools & Hardware

    * Went to the Museum with Rabbit R1 (My Thread)

    * Co-Hosts and Guests

    * Graham Neubig (@gneubig) & Xingyao Wang (@xingyaow_) from Open Devin

    * Chris Van Pelt (@vanpelt) from Weights & Biases

    * Nisten Tahiraj (@nisten) - Cohost

    * Tanishq Abraham (@iScienceLuvr)

    * Parmita Mishra (@prmshra)

    * Wolfram Ravenwolf (@WolframRvnwlf)

    * Ryan Carson (@ryancarson)

    Open Source LLMs

    Open Devin getting a whopping 21% on SWE-Bench (X, Blog)

    Open Devin started as a tweet from our friend Junyang Lin (on the Qwen team at Alibaba) to get an open source alternative to the very popular Devin code agent from Cognition Lab (recently valued at $2B ๐Ÿคฏ) and 8 weeks later, with tons of open source contributions, >100 contributors, they have almost 25K stars on Github, and now claim a State of the Art score on the very hard Swe-Bench Lite benchmark beating Devin and Swe-Agent (with 18%)

    They have done so by using the CodeAct framework developed by Xingyao, and it's honestly incredible to see how an open source can catch up and beat a very well funded AI lab, within 8 weeks! Kudos to the OpenDevin folks for the organization, and amazing results!

    DeepSeek v2 - huge MoE with 236B (21B active) parameters (X, Try It)

    The folks at DeepSeek is releasing this huge MoE (the biggest we've seen in terms of experts) with 160 experts, and 6 experts activated per forward pass. A similar trend from the Snowflake team, just extended even longer. They also introduce a lot of technical details and optimizations to the KV cache.

    With benchmark results getting close to GPT-4, Deepseek wants to take the crown in being the cheapest smartest model you can run, not only in open source btw, they are now offering this model at an incredible .28/1M tokens, that's 28 cents per 1M tokens!

    The cheapest closest model in price was Haiku at $.25 and GPT3.5 at $0.5. This is quite an incredible deal for a model with 32K (128 in open source) context and these metrics.

    Also notable is the training cost, they claim that it took them 1/5 the price of what Llama-3 cost Meta, which is also incredible. Unfortunately, running this model locally a nogo for most of us ๐Ÿ™‚

    I would mention here that metrics are not everything, as this model fails quite humorously on my basic logic tests

    LLama-3 120B chonker Merge from Maxime LaBonne (X, HF)

    We're covered Merges before, and we've had the awesome Maxime Labonne talk to us at length about model merging on ThursdAI but I've been waiting for Llama-3 merges, and Maxime did NOT dissapoint!

    A whopping 120B llama (Maxime added 50 layers to the 70B Llama3) is doing the rounds, and folks are claiming that Maxime achieved AGI ๐Ÿ˜‚ It's really funny, this model, is... something else.

    Here just one example that Maxime shared, as it goes into an existential crisis about a very simple logic question. A question that Llama-3 answers ok with some help, but this... I've never seen this. Don't forget that merging has no additional training, it's mixing layers from the same model so... we still have no idea what Merging does to a model but... some brain damange definitely is occuring.

    Oh and also it comes up with words!

    ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Big CO LLMs + APIs

    Open AI publishes Model Spec (X, Spec, Blog)

    OpenAI publishes and invites engagement and feedback for their internal set of rules for how their models should behave. Anthropic has something similar with Constitution AI.

    I specifically liked the new chain of command (Platform > Developer > User > Tool) rebranding they added to the models, making OpenAI the Platform, changing "system" prompts to "developer" and having user be the user. Very welcome renaming and clarifications (h/t Swyx for his analysis)

    Here are a summarized version of OpenAI's new rules of robotics (thanks to Ethan Mollic)

    * follow the chain of command: Platform > Developer > User > Tool

    * Comply with applicable laws

    * Don't provide info hazards

    * Protect people's privacy

    * Don't respond with NSFW contents

    Very welcome effort from OpenAI, showing this spec in the open and inviting feedback is greately appreciated!

    This comes on top of a pretty big week for OpenAI, announcing an integration with Stack Overflow, Joining the Coalition for Content Provenance and Authenticity + embedding watermarks in SORA and DALL-e images, telling us they have built a classifier that detects AI images with 96% certainty!

    im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot

    Following last week gpt2-chat mystery, Sam Altman trolled us with this tweet

    And then we got 2 new models on LMSys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and the timeline exploded with folks trying all their best logic puzzles on these two models trying to understand what they are, are they GPT5? GPT4.5? Maybe a smaller version of GPT2 that's pretrained on tons of new tokens?

    I think we may see the answer soon, but it's clear that both these models are really good, doing well on logic (better than Llama-70B, and sometimes Claude Opus as well)

    And the speculation is pretty much over, we know OpenAI is behind them after seeing this oopsie on the Arena ๐Ÿ˜‚

    you can try these models as well, they seem to be very favored in the random selection of models, but they show up only in battle mode so you have to try a few times https://chat.lmsys.org/

    Google DeepMind announces AlphaFold3 (Paper, Announcement)

    Developed by DeepMind and IsomorphicLabs, AlphaFold has previously predicted the structure of every molecule known to science, and now AlphaFold 3 was announced which can now predict the structure of other biological complexes as well, paving the way for new drugs and treatments.

    What's new here, is that they are using diffusion, yes, like Stable Diffusion, starting with noise and then denoising to get a structure, and this method is 50% more accurate than existing methods.

    If you'd like more info about this very important paper, look no further than the awesome 2 minute paper youtube, who did a thorough analysis here, and listen to the Isomorphic Labs podcast with Weights & Biases CEO Lukas on Gradient Dissent

    They also released AlphaFold server, a free research tool allowing scientists to access these capabilities and predict structures for non commercial use, however it seems that it's somewhat limited (from a conversation we had with a researcher on stage)

    This weeks Buzz (What I learned with WandB this week)

    This week, was amazing for Open Source and Weights & Biases, not every week a side project from a CIO blows up on... well everywhere. #1 trending on Github for Typescript and 6 overall, OpenUI (Github) has passed 12K stars as people are super excited about being able to build UIs with LLms, but in the open source.

    I had the awesome pleasure to host Chris on the show as he talked about the inspiration and future plans, and he gave everyone his email to send him feedback (a decision which I hope he doesn't regret ๐Ÿ˜‚) so definitely check out the last part of the show for that.

    Meanwhile here's my quick tutorial and reaction about OpenUI, but just give it a try here and build something cool!

    Vision

    I was shared some news but respecting the team I decided not to include it in the newsletter ahead of time, but expect open source to come close to GPT4-V next week ๐Ÿ‘€

    Voice & Audio

    11 Labs joins the AI music race (X)

    Breaking news from 11Labs, that happened during the show (but we didn't notice) is that they are stepping into the AI Music scene and it sounds pretty good!)

    Udio adds Audio Inpainting (X, Udio)

    This is really exciting, Udio decided to prove their investment and ship something novel!

    Inpainting has been around in diffusion models, and now selecting a piece of a song on Udio and having Udio reword it is so seamless it will definitely come to every other AI music, given how powerful this is!

    Udio also announced their pricing tiers this week, and it seems that this is the first feature that requires subscription

    AI Art & Diffusion

    ByteDance PuLID for no train ID Customization (Demo, Github, Paper)

    It used to take a LONG time to finetune something like Stable Diffusion to generate an image of your face using DreamBooth, then things like LoRA started making this much easier but still needed training.

    The latest crop of approaches for AI art customization is called ID Customization and ByteDance just released a novel, training free version called PuLID which works very very fast with very decent results! (really, try it on your own face), previous works like InstantID an IPAdapter are also worth calling out, however PuLID seems to be the state of the art here! ๐Ÿ”ฅ

    And that's it for the week, well who am I kidding, there's so much more we covered and I just didn't have the space to go deep into everything, but definitely check out the podcast episode for the whole conversation. See you next week, it's going to be ๐Ÿ”ฅ because of IO and ... other things ๐Ÿ‘€



    This is a public episode. If youโ€™d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe