Avsnitt

  • In this episode of How AI is Built, Nicolay Gerold interviews Doug Turnbull, a search engineer at Reddit and author on “Relevant Search”. They discuss how methods and technologies, including large language models (LLMs) and semantic search, contribute to relevant search results.

    Key Highlights:

    Defining relevance is challenging and depends heavily on user intent and contextCombining multiple search techniques (keyword, semantic, etc.) in tiers can improve resultsLLMs are emerging as a powerful tool for augmenting traditional search approachesOperational concerns often drive architectural decisions in large-scale search systemsUnderappreciated techniques like LambdaMART may see a resurgence

    Key Quotes:

    "There's not like a perfect measure or definition of what a relevant search result is for a given application. There are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product." - Doug Turnbull

    "I think 10 years ago, what people would do is they would just put everything in Solr, Elasticsearch or whatever, and they would make the query to Elasticsearch pretty complicated to rank what they wanted... What I see people doing more and more these days is that they'll use each retrieval source as like an independent piece of infrastructure." - Doug Turnbull on the evolution of search architecture

    "Honestly, I feel like that's a very practical and underappreciated thing. People talk about RAG and I talk, I call this GAR - generative AI augmented retrieval, so you're making search smarter with generative AI." - Doug Turnbull on using LLMs to enhance search

    "LambdaMART and gradient boosted decision trees are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem... I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making BM25 better." - Doug Turnbull on underappreciated techniques

    Doug Turnbull

    LinkedInX (Twitter)Web

    Nicolay Gerold:

    ⁠LinkedIn⁠⁠X (Twitter)

    Chapters

    00:00 Introduction and Guest Introduction 00:52 Understanding Relevant Search Results 01:18 Search Behavior on Social Media 02:14 Challenges in Defining Relevance 05:12 Query Understanding and Ranking Signals 10:57 Evolution of Search Technologies 15:15 Combining Search Techniques 21:49 Leveraging LLMs and Embeddings 25:49 Operational Considerations in Search Systems 39:09 Concluding Thoughts and Future Directions

  • In this episode, we talk data-driven search optimizations with Charlie Hull.

    Charlie is a search expert from Open Source Connections. He has built Flax, one of the leading open source search companies in the UK, has written “Searching the Enterprise”, and is one of the main voices on data-driven search.

    We discuss strategies to improve search systems quantitatively and much more.

    Key Points:

    Relevance in search is subjective and context-dependent, making it challenging to measure consistently.Common mistakes in assessing search systems include overemphasizing processing speed and relying solely on user complaints.Three main methods to measure search system performance: Human evaluationUser interaction data analysisAI-assisted judgment (with caution)Importance of balancing business objectives with user needs when optimizing search results.Technical components for assessing search systems: Query logs analysisSource data quality examinationTest queries and cases setup

    Resources mentioned:

    Quepid: Open-source tool for search quality testingHaystack conference: Upcoming event in Berlin (September 30 - October 1)Relevance Slack communityOpenSource Connections

    Charlie Hull:

    LinkedInX (Twitter)

    Nicolay Gerold:

    ⁠LinkedIn⁠⁠X (Twitter)

    search results, search systems, assessing, evaluation, improvement, data quality, user behavior, proactive, test dataset, search engine optimization, SEO, search quality, metadata, query classification, user intent, search results, metrics, business objectives, user objectives, experimentation, continuous improvement, data modeling, embeddings, machine learning, information retrieval

    00:00 Introduction
    01:35 Challenges in Measuring Search Relevance
    02:19 Common Mistakes in Search System Assessment
    03:22 Methods to Measure Search System Performance
    04:28 Human Evaluation in Search Systems
    05:18 Leveraging User Interaction Data
    06:04 Implementing AI for Search Evaluation
    09:14 Technical Components for Assessing Search Systems
    12:07 Improving Search Quality Through Data Analysis
    17:16 Proactive Search System Monitoring
    24:26 Balancing Business and User Objectives in Search
    25:08 Search Metrics and KPIs: A Contract Between Teams
    26:56 The Role of Recency and Popularity in Search Algorithms
    28:56 Experimentation: The Key to Optimizing Search
    30:57 Offline Search Labs and A/B Testing
    34:05 Simple Levers to Improve Search
    37:38 Data Modeling and Its Importance in Search
    43:29 Combining Keyword and Vector Search
    44:24 Bridging the Gap Between Machine Learning and Information Retrieval
    47:13 Closing Remarks and Contact Information

  • Saknas det avsnitt?

    Klicka här för att uppdatera flödet manuellt.

  • Welcome back to How AI Is Built.

    We have got a very special episode to kick off season two.

    Daniel Tunkelang is a search consultant currently working with Algolia. He is a leader in the field of information retrieval, recommender systems, and AI-powered search. He worked for Canva, Algolia, Cisco, Gartner, Handshake, to pick a few.

    His core focus is query understanding.

    **Query understanding is about focusing less on the results and more on the query.** The query of the user is the first-class citizen. It is about figuring out what the user wants and than finding, scoring, and ranking results based on it. So most of the work happens before you hit the database.

    **Key Takeaways:**

    - The "bag of documents" model for queries and "bag of queries" model for documents are useful approaches for representing queries and documents in search systems.
    - Query specificity is an important factor in query understanding. It can be measured using cosine similarity between query vectors and document vectors.
    - Query classification into broad categories (e.g., product taxonomy) is a high-leverage technique for improving search relevance and can act as a guardrail for query expansion and relaxation.
    - Large Language Models (LLMs) can be useful for search, but simpler techniques like query similarity using embeddings can often solve many problems without the complexity and cost of full LLM implementations.
    - Offline processing to enhance document representations (e.g., filling in missing metadata, inferring categories) can significantly improve search quality.

    **Daniel Tunkelang**

    - [LinkedIn](https://www.linkedin.com/in/dtunkelang/)
    - [Medium](https://queryunderstanding.com/)

    **Nicolay Gerold:**

    - [⁠LinkedIn⁠](https://www.linkedin.com/in/nicolay-gerold/)
    - [⁠X (Twitter)](https://twitter.com/nicolaygerold)
    - [Substack](https://nicolaygerold.substack.com/)

    Query understanding, search relevance, bag of documents, bag of queries, query specificity, query classification, named entity recognition, pre-retrieval processing, caching, large language models (LLMs), embeddings, offline processing, metadata enhancement, FastText, MiniLM, sentence transformers, visualization, precision, recall

    [00:00:00] 1. Introduction to Query Understanding

    Definition and importance in search systemsEvolution of query understanding techniques

    [00:05:30] 2. Query Representation Models

    The "bag of documents" model for queriesThe "bag of queries" model for documentsAdvantages of holistic query representation

    [00:12:00] 3. Query Specificity and Classification

    Measuring query specificity using cosine similarityImportance of query classification in search relevanceImplementing and leveraging query classifiers

    [00:19:30] 4. Named Entity Recognition in Query Understanding

    Role of NER in query processingChallenges with unique or tail entities

    [00:24:00] 5. Pre-Retrieval Query Processing

    Importance of early-stage query analysisBalancing computational resources and impact

    [00:28:30] 6. Performance Optimization Techniques

    Caching strategies for query understandingOffline processing for document enhancement

    [00:33:00] 7. Advanced Techniques: Embeddings and Language Models

    Using embeddings for query similarityRole of Large Language Models (LLMs) in searchWhen to use simpler techniques vs. complex models

    [00:39:00] 8. Practical Implementation Strategies

    Starting points for engineers new to query understandingTools and libraries for query understanding (FastText, MiniLM, etc.)Balancing precision and recall in search systems

    [00:44:00] 9. Visualization and Analysis of Query Spaces

    Discussion on t-SNE, UMAP, and other visualization techniquesLimitations and alternatives to embedding visualizations

    [00:47:00] 10. Future Directions and Closing Thoughts - Emerging trends in query understanding - Key takeaways for search system engineers

    [00:53:00] End of Episode

  • Today we are launching the season 2 of How AI Is Built.

    The last few weeks, we spoke to a lot of regular listeners and past guests and collected feedback. Analyzed our episode data. And we will be applying the learnings to season 2.

    This season will be all about search.

    We are trying to make it better, more actionable, and more in-depth. The goal is that at the end of this season, you have a full-fleshed course on search in podcast form, which mini-courses on specific elements like RAG.

    We will be talking to experts from information retrieval, information architecture, recommendation systems, and RAG; from academia and industry. Fields that do not really talk to each other.

    We will try to unify and transfer the knowledge and give you a full tour of search, so you can build your next search application or feature with confidence.

    We will be talking to Charlie Hull on how to systematically improve search systems, with Nils Reimers on the fundamental flaws of embeddings and how to fix them, with Daniel Tunkelang on how to actually understand the queries of the user, and many more.


    We will try to bridge the gaps. How to use decades of research and practice in iteratively improving traditional search and apply it to RAG. How to take new methods from recommendation systems and vector databases and bring it into traditional search systems. How to use all of the different methods as search signals and combine them to deliver the results your user actually wants.

    We will be using two types of episodes:

    Traditional deep dives, like we have done them so far. Each one will dive into one specific topic within search interviewing an expert on that topic.Supplementary episodes, which answer one additional question; often either complementary or precursory knowledge for the episode, which we did not get to in the deep dive.

    We will be starting with episodes next week, looking at the first, last, and overarching action in search: understanding user intent and understanding the queries with Daniel Tunkelang.

    I am really excited to kick this off.

    I would love to hear from you:

    What would you love to learn in this season?What guest should I have on?What topics should I make a deep dive on (try to be specific)?

    Yeah, let me know in the comments or just slide into my DMs on Twitter or LinkedIn.

    I am looking forward to hearing from you guys.

    I want to try to be more interactive. So anytime you encounter anything unclear or any question pops up in one of the episode, give me a shout and I will try to answer it to you and to everyone.

    Enough of me rambling. Let’s kick this off. I will see you next Thursday, when we start with query understanding.

    Shoot me a message and stay up to date:

    ⁠LinkedIn⁠⁠X (Twitter)
  • In this episode of "How AI is Built," host Nicolay Gerold interviews Jonathan Yarkoni, founder of Reach Latent. Jonathan shares his expertise in extracting value from unstructured data using AI, discussing challenging projects, the impact of ChatGPT, and the future of generative AI. From weather prediction to legal tech, Jonathan provides valuable insights into the practical applications of AI across various industries.

    Key Takeaways

    Generative AI projects often require less data cleaning due to the models' tolerance for "dirty" data, allowing for faster implementation in some cases.The success of AI projects post-delivery is ensured through monitoring, but automatic retraining of generative AI applications is not yet common due to evaluation challenges.Industries ripe for AI disruption include text-heavy fields like legal, education, software engineering, and marketing, as well as biotech and entertainment.The adoption of AI is expected to occur in waves, with 2024 likely focusing on internal use cases and 2025 potentially seeing more customer-facing applications as models improve.Synthetic data generation, using models like GPT-4, can be a valuable approach for training AI systems when real data is scarce or sensitive.Evaluation frameworks like RAGAS and custom metrics are essential for assessing the quality of synthetic data and AI model outputs.Jonathan’s ideal tech stack for generative AI projects includes tools like Instructor, Guardrails, Semantic Routing, DSPY, LangChain, and LlamaIndex, with a growing emphasis on evaluation stacks.

    Key Quotes

    "I think we're going to see another wave in 2024 and another one in 2025. And people are familiarized. That's kind of the wave of 2023. 2024 is probably still going to be a lot of internal use cases because it's a low risk environment and there was a lot of opportunity to be had."

    "To really get to production reliably, we have to have these tools evolve further and get more standardized so people can still use the old ways of doing production with the new technology."

    Jonathan Yarkoni

    LinkedInYouTubeX (Twitter)Reach Latent

    Nicolay Gerold:

    ⁠LinkedIn⁠⁠X (Twitter)

    Chapters

    00:00 Introduction: Extracting Value from Unstructured Data
    03:16 Flexible Tailoring Solutions to Client Needs
    05:39 Monitoring and Retraining Models in the Evolving AI Landscape
    09:15 Generative AI: Disrupting Industries and Unlocking New Possibilities
    17:47 Balancing Immediate Results and Cutting-Edge Solutions in AI Development
    28:29 Dream Tech Stack for Generative AI

    unstructured data, textual data, automation, weather prediction, data cleaning, chat GPT, AI disruption, legal, education, software engineering, marketing, biotech, immediate results, cutting-edge solutions, tech stack

  • This episode of "How AI Is Built" is all about data processing for AI. Abhishek Choudhary and Nicolay discuss Spark and alternatives to process data so it is AI-ready.

    Spark is a distributed system that allows for fast data processing by utilizing memory. It uses a dataframe representation "RDD" to simplify data processing.

    When should you use Spark to process your data for your AI Systems?

    → Use Spark when:

    Your data exceeds terabytes in volumeYou expect unpredictable data growthYour pipeline involves multiple complex operationsYou already have a Spark cluster (e.g., Databricks)Your team has strong Spark expertiseYou need distributed computing for performanceBudget allows for Spark infrastructure costs

    → Consider alternatives when:

    Dealing with datasets under 1TBIn early stages of AI developmentBudget constraints limit infrastructure spendingSimpler tools like Pandas or DuckDB suffice

    Spark isn't always necessary. Evaluate your specific needs and resources before committing to a Spark-based solution for AI data processing.

    In today’s episode of How AI Is Built, Abhishek and I discuss data processing:

    When to use Spark vs. alternatives for data processingKey components of Spark: RDDs, DataFrames, and SQLIntegrating AI into data pipelinesChallenges with LLM latency and consistencyData storage strategies for AI workloadsOrchestration tools for data pipelinesTips for making LLMs more reliable in production

    Abhishek Choudhary:

    LinkedInGitHubX (Twitter)

    Nicolay Gerold:

    ⁠LinkedIn⁠⁠X (Twitter)
  • In this episode, Nicolay talks with Rahul Parundekar, founder of AI Hero, about the current state and future of AI agents. Drawing from over a decade of experience working on agent technology at companies like Toyota, Rahul emphasizes the importance of focusing on realistic, bounded use cases rather than chasing full autonomy.

    They dive into the key challenges, like effectively capturing expert workflows and decision processes, delivering seamless user experiences that integrate into existing routines, and managing costs through techniques like guardrails and optimized model choices. The conversation also explores potential new paradigms for agent interactions beyond just chat.

    Key Takeaways:

    Agents need to focus on realistic use cases rather than trying to be fully autonomous. Enterprises are unlikely to allow agents full autonomy anytime soon. Capturing the logic and workflows in the user's head is the key challenge. Shadowing experts and having them demonstrate workflows is more effective than asking them to document processes. User experience is crucial - agents must integrate seamlessly into existing user workflows without major disruptions. Interfaces beyond just chat may be needed. Cost control is important - techniques like guardrails, context windowing, model choice optimization, and dev vs production modes can help manage costs. New paradigms beyond just chat could be powerful - e.g. workflow specification, state/declarative definition of desired end-state. Prompt engineering and dynamic prompt improvement based on feedback remain an open challenge.

    Key Quotes:

    "Empowering users to create their own workflows is essential for effective agent usage." "Capturing workflows accurately is a significant challenge in agent development." "Preferences, right? So a lot of the work becomes like, hey, can you do preference learning for this user so that the next time the user doesn't have to enter the same information again, things like that."

    Rahul Parundekar:

    AI Hero AI Hero Docs

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    00:00 Exploring the Potential of Autonomous Agents

    02:23 Challenges of Accuracy and Repeatability in Agents

    08:31 Capturing User Workflows and Improving Prompts

    13:37 Tech Stack for Implementing Agents in the Enterprise

    agent development, determinism, user experience, agent paradigms, private use, human-agent interaction, user workflows, agent deployment, human-in-the-loop, LLMs, declarative ways, scalability, AI Hero

  • In this conversation, Nicolay and Richmond Alake discuss various topics related to building AI agents and using MongoDB in the AI space. They cover the use of agents and multi-agents, the challenges of controlling agent behavior, and the importance of prompt compression.

    When you are building agents. Build them iteratively. Start with simple LLM calls before moving to multi-agent systems.

    Main Takeaways:

    Prompt Compression: Using techniques like prompt compression can significantly reduce the cost of running LLM-based applications by reducing the number of tokens sent to the model. This becomes crucial when scaling to production. Memory Management: Effective memory management is key for building reliable agents. Consider different memory components like long-term memory (knowledge base), short-term memory (conversation history), semantic cache, and operational data (system logs). Store each in separate collections for easy access and reference. Performance Optimization: Optimize performance across multiple dimensions - output quality (by tuning context and knowledge base), latency (using semantic caching), and scalability (using auto-scaling databases like MongoDB). Prompting Techniques: Leverage prompting techniques like ReAct (observe, plan, act) and structured prompts (JSON, pseudo-code) to improve agent predictability and output quality. Experimentation: Continuous experimentation is crucial in this rapidly evolving field. Try different frameworks (LangChain, Crew AI, Haystack), models (Claude, Anthropic, open-source), and techniques to find the best fit for your use case.

    Richmond Alake:

    LinkedIn Medium Find Richmond on MongoDB X (Twitter) YouTube GenAI Showcase MongoDB MongoDB AI Stack

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    00:00 Reducing the Scope of AI Agents

    01:55 Seamless Data Ingestion

    03:20 Challenges and Considerations in Implementing Multi-Agents

    06:05 Memory Modeling for Robust Agents with MongoDB

    15:05 Performance Optimization in AI Agents

    18:19 RAG Setup

    AI agents, multi-agents, prompt compression, MongoDB, data storage, data ingestion, performance optimization, tooling, generative AI

  • In this episode, Kirk Marple, CEO and founder of Graphlit, shares his expertise on building efficient data integrations.

    Kirk breaks down his approach using relatable concepts:

    The "Two-Sided Funnel": This model streamlines data flow by converting various data sources into a standard format before distributing it. Universal Data Streams: Kirk explains how he transforms diverse data into a single, manageable stream of information. Parallel Processing: Learn about the "competing consumer model" that allows for faster data handling. Building Blocks for Success: Discover the importance of well-defined interfaces and actor models in creating robust data systems. Tech Talk: Kirk discusses data normalization techniques and the potential shift towards a more streamlined "Kappa architecture." Reusable Patterns: Find out how Kirk's methods can speed up the integration of new data sources.

    Kirk Marple:

    LinkedIn X (Twitter) Graphlit Graphlit Docs

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Chapters

    00:00 Building Integrations into Different Tools

    00:44 The Two-Sided Funnel Model for Data Flow

    04:07 Using Well-Defined Interfaces for Faster Integration

    04:36 Managing Feeds and State with Actor Models

    06:05 The Importance of Data Normalization

    10:54 Tech Stack for Data Flow

    11:52 Progression towards a Kappa Architecture

    13:45 Reusability of Patterns for Faster Integration

    data integration, data sources, data flow, two-sided funnel model, canonical format, stream of ingestible objects, competing consumer model, well-defined interfaces, actor model, data normalization, tech stack, Kappa architecture, reusability of patterns

  • In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs).

    Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns.

    "I think people are trying to optimize around the chunking strategy... But for me, that seems a bit maybe not focusing on the right area of optimization. These embedding models themselves have gone just like, so much more advanced over the past five to 10 years that regardless of what representation you're passing in, they do a pretty good job of being able to understand that information semantically and returning the relevant chunks." - Derek Tu on the importance of embedding models over chunking strategies "If you are cost conscious and if you're worried about performance, I would definitely look at quantizing your embeddings. I think we've probably been able to, I don't have like the exact numbers here, but I think we might be saving at least half, right, in storage costs by quantizing everything." - Derek Tu on optimizing costs and performance with vector databases

    Derek Tu:

    LinkedIn Carbon

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Key Takeaways:

    Understand your data sources: Before building your ETL pipeline, thoroughly assess the various data sources you'll be working with, such as Slack, Email, Google Docs, and more. Consider the unique characteristics of each source, including data format, structure, and metadata. Normalize and preprocess data: Develop strategies to normalize and preprocess the unstructured data from different sources. This may involve parsing, cleaning, and transforming the data into a standardized format that can be easily consumed by your AI models. Experiment with chunking strategies: While there's no one-size-fits-all approach to chunking, it's essential to experiment with different strategies to find what works best for your specific use case. Consider factors like data format, structure, and the desired granularity of the chunks. Leverage metadata and tagging: Metadata and tagging can play a crucial role in organizing and retrieving relevant data for your AI models. Implement mechanisms to capture and store important metadata, such as document types, topics, and timestamps, and consider using AI-powered tagging to automatically categorize your data. Choose the right embedding model: Embedding models have advanced significantly in recent years, so focus on selecting the right model for your needs rather than over-optimizing chunking strategies. Consider factors like model performance, dimensionality, and compatibility with your data types. Optimize vector database usage: When working with vector databases, consider techniques like quantization to reduce storage costs and improve performance. Experiment with different configurations and settings to find the optimal balance for your specific use case.

    00:00 Introduction and Optimizing Embedding Models

    03:00 The Evolution of Carbon and Focus on Unstructured Data

    06:19 Customer Progression and Target Group

    09:43 Interesting Use Cases and Handling Different Data Representations

    13:30 Chunking Strategies and Normalization

    20:14 Approach to Chunking and Choosing a Vector Database

    23:06 Tech Stack and Recommended Tools

    28:19 Future of Carbon: Multimodal Models and Building a Platform

    Carbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises

  • In this episode, Nicolay sits down with Hugo Lu, founder and CEO of Orchestra, a modern data orchestration platform. As data pipelines and analytics workflows become increasingly complex, spanning multiple teams, tools and cloud services, the need for unified orchestration and visibility has never been greater.

    Orchestra is a serverless data orchestration tool that aims to provide a unified control plane for managing data pipelines, infrastructure, and analytics across an organization's modern data stack.

    The core architecture involves users building pipelines as code which then run on Orchestra's serverless infrastructure. It can orchestrate tasks like data ingestion, transformation, AI calls, as well as monitoring and getting analytics on data products. All with end-to-end visibility, data lineage and governance even when organizations have a scattered, modular data architecture across teams and tools.

    Key Quotes:

    Find the right level of abstraction when building data orchestration tasks/workflows."I think the right level of abstraction is always good. I think like Prefect do this really well, right? Their big sell was, just put a decorator on a function and it becomes a task. That is a great idea. You know, just make tasks modular and have them do all the boilerplate stuff like error logging, monitoring of data, all of that stuff.” Modularize data pipeline components:"It's just around understanding what that dev workflow should look like. I think it should be a bit more modular."Having a modular architecture where different components like data ingestion, transformation, model training are decoupled allows better flexibility and scalability. Adopt a streaming/event-driven architecture for low-latency AI use cases:"If you've got an event-driven architecture, then, you know, that's not what you use an orchestration tool for...if you're having a conversation with a chatbot, like, you know, you're sending messages, you're sending events, you're getting a response back. That I would argue should be dealt with by microservices."

    Hugo Lu:

    LinkedIn Newsletter Orchestra Orchestra Docs

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    00:00 Introduction to Orchestra and its Focus on Data Products

    08:03 Unified Control Plane for Data Stack and End-to-End Control

    14:42 Use Cases and Unique Applications of Orchestra

    19:31 Retaining Existing Dev Workflows and Best Practices in Orchestra

    22:23 Event-Driven Architectures and Monitoring in Orchestra

    23:49 Putting Data Products First and Monitoring Health and Usage

    25:40 The Future of Data Orchestration: Stream-Based and Cost-Effective

    data orchestration, Orchestra, serverless architecture, versatility, use cases, maturity levels, challenges, AI workloads

  • Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!

    Zain Hasan:

    LinkedIn X (Twitter) Weaviate

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Key Insights:

    Vector databases can handle not just text, but also image, audio, and video data Quantization is a powerful technique to significantly reduce costs and enable in-memory search Binary quantization allows efficient brute force search for smaller datasets Multi-vector search enables retrieval of heterogeneous data types within the same index The future lies in multimodal search and recommendations across different senses Brain-computer interfaces and EEG foundation models are exciting areas to watch

    Key Quotes:

    "Vector databases are pretty much the commercialization and the productization of representation learning." "I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality." "Going from text to multimedia in vector databases is really simple." "Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."

    Chapters

    00:00 - 01:24 Introduction

    01:24 - 03:48 Underappreciated aspects of vector databases

    03:48 - 06:06 Quantization trade-offs and techniques

    Various quantization techniques: binary quantization, product quantization, scalar quantization

    06:06 - 08:24 Binary quantization

    Reducing vectors from 32-bits per dimension down to 1-bit Enables efficient in-memory brute force search for smaller datasets Requires normally distributed data between negative and positive values

    08:24 - 10:44 Product quantization and other techniques

    Alternative to binary quantization, segments vectors and clusters each segment Scalar quantization reduces vectors to 8-bits per dimension

    10:44 - 13:08 Quantization as a "superpower" to reduce costs

    13:08 - 15:34 Comparing quantization approaches

    15:34 - 17:51 Placing vector databases in the database landscape

    17:51 - 20:12 Pruning unused vectors and nodes

    20:12 - 22:37 Improving precision beyond similarity thresholds

    22:37 - 25:03 Multi-vector search

    25:03 - 27:11 Impact of vector databases on data interaction

    27:11 - 29:35 Interesting and weird use cases

    29:35 - 32:00 Future of multimodal search and recommendations

    32:00 - 34:22 Extending recommendations to user data

    34:22 - 36:39 What's next for Weaviate

    36:39 - 38:57 Exciting technologies beyond vector databases and LLMs

    vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications

  • In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

    Summary by Section

    Introduction

    Anjan Banerjee, a data architect, discusses building complex AI and data systems Explains the basics of data architecture using Lego and chat app examples

    Sources and Tools

    Identifying data sources is the first step in designing a data architecture Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.) Use one tool for most activities if possible, but specialized tools offer benefits Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

    Airflow and Orchestration

    Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills For less technical orgs, GUI-based tools like Talend, Alteryx may be better AWS Step Functions and managed Airflow are improving native orchestration capabilities For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

    AI and Data Processing

    ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud TinyML and edge computing enable ML inference on device (drones, manufacturing) Cloud batch processing still dominates for user targeting, recommendations

    Data Lakes and Storage

    Storage choice depends on data types, use cases, cloud ecosystem Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata Pulling data into separate system often needed for advanced analytics beyond source system

    Data Quality and Standardization

    "Poka-yoke" error-proofing of input screens is vital for downstream data quality Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

    Hot Takes and Wishes

    Snowflake is overhyped; great UX but costly at scale. Databricks is preferred. Automated data set joining and entity resolution across systems would be a game-changer

    Anjan Banerjee:

    LinkedIn

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    00:00 Understanding Data Architecture

    12:36 Choosing the Right Tools

    20:36 The Benefits of Serverless Functions

    21:34 Integrating AI in Data Acquisition

    24:31 The Trend Towards Single Node Engines

    26:51 Choosing the Right Database Management System and Storage

    29:45 Adding Additional Storage Components

    32:35 Reducing Human Errors for Better Data Quality

    39:07 Overhyped and Underutilized Tools

    Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

  • Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.

    Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads. Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars). Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance. Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines. Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

    Key Takeaways:

    Lake houses offer a powerful and flexible architecture for modern data analytics. Open-source solutions provide cost-effective and customizable alternatives. Carefully consider your specific use cases and preferences when choosing tools and components. Tools like DLT simplify data ingress and can be easily integrated with serverless functions. The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

    Sound Bites

    "The Lake house is sort of a modular setup where you decouple the storage and the compute.""A lake house is an architecture, an architecture for data analytics platforms.""The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

    Jorrit Sandbrink:

    LinkedIn dlt

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Chapters

    00:00 Introduction to the Lake House Architecture

    03:59 Choosing Storage and Table Formats

    06:19 Comparing Compute Engines

    21:37 Simplifying Data Ingress

    25:01 Building a Preferred Data Stack

    lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage

  • Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.

    Key Points:

    Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval. Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases. Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding. Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG. Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.

    Notable Quotes:

    "Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play." "Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that." "[For RAG]...you have to find constraints to make it workable." "Entity resolution, deduping, I think is probably the number one thing." "I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..." "One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it." "Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."

    Kirk Marple:

    LinkedIn X (Twitter) Graphlit Graphlit Docs

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Chapters

    00:00 Graphlit’s Hybrid Approach02:23 Use Cases and Transition to Graphlit04:19 Knowledge Graphs as a Filtering Mechanism13:23 Using Gremlin for Querying the Graph32:36 XML in Prompts for Better Segmentation35:04 The Future of LLMs and Graphlit36:25 Getting Started with Graphlit

    Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal

  • From Problem to Requirements to Architecture.

    In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.

    "Don't overcomplicate what you're actually doing." "Getting your basic programming software development skills down is super important to becoming a good data engineer." "Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

    Key Takeaways:

    Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this. Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use. Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts. Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management. The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success. The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

    Jon Erik Kemi Warghed:

    LinkedIn

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Chapters

    00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

    00:57 How to Choose the Right Tools: Considerations for startups and large companies

    03:13 Evaluating Open Source Tools: Background checks and due diligence

    07:52 Defining Data Governance: Transparency and understanding of data

    10:15 The Importance of Data Governance: Challenges and solutions

    12:21 Data Governance Tools: dbt and Dagster

    17:05 The Impact of Dagster: Software-defined assets and declarative thinking

    19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

    21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

    26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

    28:47 The Importance of Tool Selection: Thinking about long-term sustainability

    31:10 When to Adopt Orchestration: Identifying the need for orchestration tools

  • In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.

    Sound Bites

    "The modern era, definitely airflow. Took the market share, a lot of people running it themselves.""It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator.""The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."

    Key Topics

    The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines. The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more. Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools. Data residency and GDPR: How regulations influence tool selection, especially in Europe. Future of the field: The need for consolidation and finding the right balance between features and usability.

    John Wessel:

    LinkedIn Data Stack Show Agreeable Data

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.

    Chapters

    00:00 Introduction and Overview

    00:34 The Evolution of Data Orchestration Tools

    04:54 Components and Flow of Data in Orchestrators

    08:24 Deployment Options: Serverless vs. Kubernetes

    11:14 Considerations for Data Residency and Security

    13:02 The Need for a Clear Winner in the Orchestration Space

    20:47 Optimization Techniques for Memory and Time-Limited Issues

    23:09 Integrating Orchestrators with Infrastructure-as-Code

    24:33 Bridging the Gap Between Data and Engineering Practices

    27:2 2Exciting Technologies Outside of Data Orchestration

    30:09 The Feature of Dagster

  • In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness.

    Main Insights

    Challenges of Open-Source Models: Open-source large language models (LLMs) can be powerful tools, but require significant post-training optimization for specific use cases. Evaluation Before Deployment: Thorough testing and evaluation are key to preventing unexpected behaviors and hallucinations in deployed RAGs. Ragas offers metrics and synthetic data generation to support this process. Data is Key: The quality and distribution of data used to train and evaluate LLMs dramatically impact their performance. Ragas is enabling novel synthetic data generation techniques to make this process more effective and cost-efficient. RAG Evolution: Techniques for improving RAGs are continuously evolving. Developers must be prepared to experiment and keep up with the latest advancements in chunk embedding, query transformation, and model alignment.

    Practical Takeaways

    Start with a solid testing strategy: Before launching, define the quality metrics aligned with your RAG's purpose. Ragas helps in this process. Embrace synthetic data: Manually creating test data sets is time-consuming. Tools within Ragas help automate the creation of synthetic data to mirror real-world use cases. RAGs are iterative: Be prepared for continuous improvement as better techniques and models emerge.

    Interesting Quotes

    "...models are very stochastic and grading it directly would rather trigger them to give some random number..." - Shahul, on the dangers of naive model evaluation. "Reducing the developer time in acquiring these test data sets by 90%." - Shahul, on the efficiency gains of Ragas' synthetic data generation. "We want to ensure maximum diversity..." - Shahul, on creating realistic and challenging test data for RAG evaluation.

    Ragas:

    Web Docs

    Jithin James:

    LinkedIn

    Shahul ES:

    LinkedIn X (Twitter)

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    00:00 Introduction

    02:03 Introduction to Open Assistant project

    04:05 Creating Customizable and Fine-Tunable Models

    06:07 Ragas and the LLM Use Case

    08:09 Introduction to Language Model Metrics (LLMs)

    11:12 Reducing the Cost of Data Generation

    13:19 Evaluation of Components at Melvess

    15:40 Combining Ragas Metrics with AutoML Providers

    20:08 Improving Performance with Fine-tuning and Reranking

    22:56 End-to-End Metrics and Component-Specific Metrics

    25:14 The Importance of Deep Knowledge and Understanding

    25:53 Robustness vs Optimization

    26:32 Challenges of Evaluating Models

    27:18 Creating a Dream Tech Stack

    27:47 The Future Roadmap for Ragas

    28:02 Doubling Down on Grid Data Generation

    28:12 Open-Source Models and Expanded Support

    28:20 More Metrics for Different Applications

    RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.

  • In this episode of Changelog, Weston Pace dives into the latest updates to LanceDB, an open-source vector database and file format. Lance's new V2 file format redefines the traditional notion of columnar storage, allowing for more efficient handling of large multimodal datasets like images and embeddings. Weston discusses the goals driving LanceDB's development, including null value support, multimodal data handling, and finding an optimal balance for search performance.

    Sound Bites

    "A little bit more power to actually just try.""We're becoming a little bit more feature complete with returns of arrow.""Weird data representations that are actually really optimized for your use case."

    Key Points

    Weston introduces LanceDB, an open-source multimodal vector database and file format. The goals behind LanceDB's design: handling null values, multimodal data, and finding the right balance between point lookups and full dataset scan performance. Lance V2 File Format: Potential Use Cases

    Conversation Highlights

    On the benefits of Arrow integration: Strengthening the connection with the Arrow data ecosystem for seamless data handling. Why "columnar container format"?: A broader definition than "table format" to encompass more unconventional use cases. Tackling multimodal data: How LanceDB V2 enables storage of large multimodal data efficiently and without needing tons of memory. Python's role in encoding experimentation: Providing a way to rapidly prototype custom encodings and plug them into LanceDB.

    LanceDB:

    X (Twitter) GitHub Web Discord VectorDB Recipes Lance V2

    Weston Pace:

    LinkedIn GitHub

    Nicolay Gerold:

    ⁠LinkedIn⁠ ⁠X (Twitter)

    Chapters

    00:00 Introducing Lance: A New File Format

    06:46 Enabling Custom Encodings in Lance

    11:51 Exploring the Relationship Between Lance and Arrow

    20:04 New Chapter

    Lance file format, nulls, round-tripping data, optimized data representations, full-text search, encodings, downsides, multimodal data, compression, point lookups, full scan performance, non-contiguous columns, custom encodings

  • Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:

    Core components and how they power real-time AI solutions Optimizing Postgres for AI workloads The magic of PG Vector and other key extensions Supabase’s future and exciting new features

    Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:

    Core components and how they power real-time AI solutions Optimizing Postgres for AI workloads The magic of PG Vector and other key extensions Supabase’s future and exciting new features