Avsnitt
-
An aws engineer discovered a 50% regression in postgres throughput while testing the new Linux 7.0 kernel. The cause turns out to be massive TLB and page faults exacerbated by Postgres process-based design. In this backend engineering show episode I dive deep into how this was discovered, the root cause and the possible fixes and workarounds. Intermediate and Advanced Backend Engineering Course Bundlehttps://courses.husseinnasser.com/bundleMy Book, Root Cause: Stories and Lessons from Two Decades of Backend Engineering Bugs https://amzn.to/4cKfZhe 0:00 Intro2:30 The Discovery6:30 Spinlocks9:25 Preemption 13:00 Root Cause17:00 How Postgres Processes exacerbated the problem 22:30 Is the fix easy?25:50 Summary
-
A discussion about why many engineers still love the struggle, the mistakes, and the process of figuring things out themselves. This is how we grow and get better and stronger. Letting AI do everything (even though it can’t) robs us this feeling..
-
Saknas det avsnitt?
-
I wrote a new book that has been in the works for years. It is called Root Cause, and it is for those who enjoy the art of backend engineering.
Early in my career, 20 years ago, I built backend and database applications without fully grasping their inner mechanics. Performance issues, race conditions, bugs, and even data corruption often left me lost.
Since that day, I resolved to truly understand how systems work. From networking protocols and intermediary proxies to backend services and various database engines. I made it a habit to follow every request on its journey through the dark alleys of the network, down to the bowels of the database engine, meanwhile interacting with various kernel data structures in the process at every hop, and back.
I became obsessed with understanding what happens behind the scenes in software. Not just what breaks, and how but also why and what was the source of the bleed.
Root Cause is a collection of the most interesting bugs I encountered, ranging from performance bottlenecks and non-deterministic crashes to subtle data inconsistencies and incorrect results.
This book is for anyone curious about how production backend systems really behave under pressure, and how to debug them when they don’t. Even when you don’t have access to the source code.
Root cause consists of 15 chapters, each is a story about a backend bug, with investigation, diagrams, a section of a fundamental concept until the root cause is revealed.
Grab your copy here paperback or kindle ebook
paperback
https://amzn.to/4cKfZhe
ebook
https://amzn.to/4cfQjJj
-
Page tables provide the mapping between virtual memory and physical memory for each process. This means it needs to be as efficient and as fast as possible. I explore the inner workings of page tables in this episode.
0:00 Intro2:00 Virtual Memory⁃ ⁃ 8:00 MMU
10:00 Page Tables⁃ ⁃ ⁃ ⁃ ⁃ ⁃ ⁃ 11:30 Single Table Byte Addressability
⁃ ⁃ ⁃ ⁃ ⁃ ⁃ ⁃ ⁃ 16:00 Single Table Page addressability
⁃ ⁃ ⁃ ⁃ ⁃ 19:00 Multi-level Paging (Radix tree)
⁃ ⁃ 31:00 Huge Tables
⁃ ⁃ 33:00 TLB
⁃ ⁃ Summary
-
Page faults occurs when the process tries to access a memory that isn’t backed by a physical page kernel raises a fault which loads a page. It happens on first access, stack expansion, COW, swap and much more. However it comes with a cost.
In this episode of the backend engineering show I dissect the need and the cost page faults in the kernel.
0:00 Intro 4:00 Virtual memoryAbstraction of physical memoryMemory sharingAllow more processes to run , unused go to diskNuma, kernel can place memory near the cpu12:00 VMA areasText/code Data BSSHeapStack19:50 Kernel mode25:30 What is a Page fault?30:30 First access page fault33:00 Stack Expansion page fault34:30 CoW page fault38:00 Swap page fault39:39 File backed page fault40:29 Permission page fault 45:30 Summary -
On October 19 2025 AWS experienced an outage that lasted over a day, 10 days later we finally got the root cause analysis and we know exactly what caused the DNS to fail0:00 Summary 5:30 How did Dynamo lost its DNS?13:41 EC2 Errors 16:16 Network Load Balancer ErrorsRCA here https://aws.amazon.com/message/101925/
-
There are cases where the backend may need to close the connection to prevent unexpected situations, prevent bad actors or simply just free up resources. Closing a connection gracefully allows clients and backends to clean up and finish any pending requests.
In this episode of the backend engineering show I discuss graceful connections in both HTTP/1.1 via the connection header and HTTP/2 via the GOAWAY frame.
0:00 Intro4:58 Why shutdown connection? 6:46 HTTP/1.1 Graceful shutdown12:26 Cost of HTTP/2 17:40 HTTP/2 GoAWAY frame23:40 SummaryLinks
https://www.youtube.com/watch?v=fVKPrDrEwTI&t=1s
https://chromium.googlesource.com/chromium/src/net/%2B/master/socket/client_socket_pool_manager.cc#76
https://issues.chromium.org/issues/40555364
https://issues.chromium.org/issues/40501721
-
Postgres 18 has been released with many exciting features such as UUIDv7, Over explain module, composite index skip scans, and the most anticipated asynchronous IO with worker and io_uring mode which I uncover in this show. Hope you enjoy it
0:00 Intro1:30 Synchronous vs Asynchronous calls
3:00 Synchronous IO
6:30 Asynchronous IO
10:00 Postgres 17 synchronous io
17:20 The challenge of Async IO in Postgres 18
20:00 io_method worker23:00
io_method io_uring
29:30 io_method sync
31:08 Async IO isn’t done! 3
1:30 Support for backend writers
32:36 Improve worker io_method
33:00 direct io support
37:00 Summary
-
Fundamentals of Operating Systems Course https://oscourse.winktls is brilliant.TLS encryption/decryption often happens in userland. While TCP lives in the kernel. With ktls, userland can hand the keys to the kernel and the kernel does crypto. When calling write, the kernel encrypts the packet and send it to the NIC.When calling read, the kernel decrypts the packet and handed it to the userspace. This mode still taxes the host’s CPU of course, so there is another mode where the kernel offloads the crypto to the NIC device! Host CPU becomes free. Incoming packets to the NIC are decrypted in device before they are DMAed to the kernel. outgoing packets are encrypted before they leave the NIC to the network.ktls still need handshake to happen in userspace. There is also enabling zerocopy in some cases (now that kernel has context) Deserves a video. So much good stuff.0:00 Intro2:00 Userspace SSL Libraries 3:00 ktls 6:00 Kernel Encrypts/Decrypts (TLS_SW)8:20 NIC offload mode (TLS_HW)10:15 NIC does it all (TLS_HW_RECORD)12:00 Write TX Example13:50 Read RX Example17:00 Zero copy (sendfile)https://docs.kernel.org/networking/tls-offload.html
-
If you are bored of contemporary topics of AI and need a breather, I invite you to join me to explore a mundane, fundamental and earthy topic.
The CPU.
A reading of my substack article https://hnasr.substack.com/p/the-beauty-of-the-cpu
-
This new PostgreSQL 17 feature is game changer. They know can combine IOs when performing sequential scan.
Grab my database course
https://courses.husseinnasser.com
-
No technical video today, just talking about the idea of discipline and consistency.
-
Fundamentals of Operating Systems Course
This video is an overview of how the operating system kernel does socket management and the different data structures it utilizes to achieve that.
timestamps
0:00 Intro
1:38 Socket vs Connections
7:50 SYN and Accept Queue
18:56 Socket Sharding
23:14 Receive and Send buffers
27:00 Summary
-
Polling is the ability to interrogate a backend to see if a piece of information is ready. It can introduce a chatty system and as a result long polling was born. In this video I explain the beauty of this design pattern and how we can push it to its limit. 0:00 Intro0:45 Polling2:30 Problem with Polling3:50 Long Polling8:18 Timeouts10:00 Long Polling Benefits12:00 Make requests into Long Polling17:36 Request Resumption21:40 Summary
-
You get better as a software engineer when you go through these stages.
0:00 Intro
1:15 Understand a technology
7:07 Articulate how it works
15:30 Understand its’ limitations
19:48 Try to build something better
27:45 Realize what you built also has limitations
32:48 Appreciate the original tech as is
Understand a technologyWe use technologies all the time without knowing how it works. And it is ok not knowing how things work if interests isn’t there. But when there is interest to understand how something works, pursue it. It feels good when you understand how something works because you work better with it, you swim with the tide instead of against it.
When I learned how TCP/IP work.. you would appreciate every connection request, how you read requests. You will ask questions,
what is my code doing here?
When exactly I’m creating connections?
When am I reading from the connection?
Is it safe to share connections?
Articulate how it worksThis one is not easy, you might think you understand something until you try to explain how it works. If you find yourself using jargon you probably don’t understand and you just try to impress others. Have you seen people who want to talk about something to show they understand it? It’s the opposite. Try to truly articlate how it works, you will really understand it , back to 1.
I thought I understand how backend reads requests until I tried to speak to it.
Understand the technology limitationsOnce 1,2 are done you will truly understand the tech, now you are confidant, you are excited about the tech and you will truly see when you can use the tech to its full potential and also know the weak points of the tech where it breaks, this happens a lot with TCP/IP. We know tcps limitations.
Try to build something betterThis one is optional and can be skipped, but attempting to design or building something better then the tech because you know the limitations will truly reveal how you became better. But the challenge here is the ego, you might understand the limitations but you problem is thinking that what you will build is flawless. This step must be proceed with caution.
Realize what you build also has limitationDust settles.. this step hurts, and you may take a while to realize it, but whatever you build will have flaws… and when you realize this it is when you get better as an engineer.
Appreciate the tech as isThis is when you are back full circle you are back to the first stage, look at the technology and understand it but don’t judge it.. just know the limitations and its strength and flow with it. Stop fighting and instead build around a tech, does that mean you shouldn’t build anything new, of course not. Go build, but don’t stress around making something better to defeat existing tech. But actually build it for building it.
-
Fundamentals of Operating Systems Course https://oscourse.winVery clever! We often call read/rcv system call to read requests from a connection, this copies data from kernel receive buffer to user space which has a cost. This new patch changes this to allow zero copy with notification. “Reading' data out of a socket instead becomes a “notification” mechanism, where the kernel tells userspace where the data is.”This kernel patch enables zero copy from the receive queue. https://lore.kernel.org/io-uring/ZwW7_cRr_UpbEC-X@LQ3V64L9R2/T/0:00 Intro1:30 patch summary7:00 Normal Connection Read (Kernel Copy)12:40 Zero copy Read15:30 Performance
-
Cloudflare built a global cache purge system that runs under 150 ms.
This is how they did it.
Using RockDB to maintain local CDN cache, and a peer-to-peer data center distributed system and clever engineering, they went from 1.5 second purge, down to 150 ms.
However, this isn’t full picture, because that 150 ms is just actually the P50. In this video I explore Clouldflare CDN work, how the old core-based centralized quicksilver, lazy purge work compared to the new coreless, decentralized active purge. In it I explore the pros and cons of both systems and give you my thoughts of this system.
0:00 Intro
4:25 From Core Base Lazy Purge to Coreless Active
12:50 CDN Basics
16:00 TTL Freshness
17:50 Purge
20:00 Core-Based Purge
24:00 Flexible Purges
26:36 Lazy Purge
30:00 Old Purge System Limitations
36:00 Coreless / Active Purge
39:00 LSM vs BTree
45:30 LSM Performance issues
48:00 How Active Purge Works
50:30 My thoughts about the new system
58:30 Summary
Cloudflare blog
https://blog.cloudflare.com/instant-purge/
Mentioned Videos
Cloudflare blog
https://blog.cloudflare.com/instant-purge/
Percentile Tail Latency Explained (95%, 99%) Monitor Backend performance with this metric
https://www.youtube.com/watch?v=3JdQOExKtUY
How Discord Stores Trillions of Messages | Deep Dive
https://www.youtube.com/watch?v=xynXjChKkJc
Fundamentals of Operating Systems Course
https://os.husseinnasser.com
Backend Troubleshooting Course
https://performance.husseinnasser.com
-
Fundamentals of Database Engineering udemy course https://databases.winMySQL has been having bumpy journey since 2018 with the release of the version 8.0. Critical crashes that made to the final product, significant performance regressions, and tons of stability and bugs issues. In this video I explore what happened to MySql, are these issues getting fixed? And what is the current state of MySQL at the end of 2024. 0:00 Intro 2:00 MySQL 8.0 vs 5.7 Performance11:00 Critical Crash in 8.0.38, 8.4.1 and 9.0.0 15:40 Is 8.4 better than 8.0.36?16:30 More Features = More Bugs22:30 Summary and my thoughts resources https://x.com/MarkCallaghanDB/status/1786428909376164263https://www.percona.com/blog/do-not-upgrade-to-any-version-of-mysql-after-8-0-37/http://smalldatum.blogspot.com/2024/09/mysql-innodb-vs-sysbench-on-large-server.htmlhttps://www.percona.com/blog/mysql-8-0-vs-5-7-are-the-newer-versions-more-problematic/
-
Fundamentals of Operating Systems Course https://oscourse.winIn this video I use strace a performance tool that measures how many system calls does a process makes. We compare a simple task of reading from a file, and we run the program in different runtimes, namely nodejs, buns , python and native C. We discuss the cost of kernel mode switches, system calls and pe0:00 Intro5:00 Code Explanation6:30 Python9:30 NodeJS12:30 BunJS13:12 C16:00 Summary
- Visa fler