CloudFormation Drift Detection: AWS Config + Lambda Auto-Remediation

Avsnitt

Kubernetes Scheduler Extenders: Custom Placement Logic
15 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Learn how to write a Kubernetes scheduler extender webhook that restricts GPU pods to nodes with NVLink interconnects. This episode covers the extender contract, KubeSchedulerConfiguration registration, filter and prioritize endpoints, and the latency tradeoffs interviewers probe in senior SRE and platform engineering interviews.
Full interview prep guides and scenario walkthroughs: DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
The Kubernetes Machine: From kubectl apply to Running Containers
12 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
What really happens when you run kubectl apply?
In Part 1 of this Kubernetes masterclass, we go far beyond basic definitions and trace how Kubernetes works as a distributed, API-driven control system.
You will learn how a YAML manifest moves through kubectl, the API server, authentication, authorization, admission, etcd, controllers, the scheduler, kubelet and the container runtime before finally becoming a running Pod.
This episode also explains the deeper ideas that make Kubernetes work:
Desired state versus observed stateReconciliation loopsspec versus statusWatches and eventsLabels and selectorsReplicaSets and DeploymentsScheduling decisionsPod lifecycleOwner references, finalizers and garbage collectionServer-side apply and field ownership
By the end of this episode, you will be able to mentally replay the complete journey from user intent to a healthy running workload—and understand which component is responsible at every step.
Mental model:
Intent → Store → Observe → Reconcile
Full interview prep guides and scenario walkthroughs: DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Saknas det avsnitt?

Klicka här för att uppdatera flödet manuellt.
Cluster Autoscaler vs Karpenter: Choosing at 500 Nodes
8 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Most engineers assume Karpenter is always the right answer for Kubernetes node autoscaling, but at 500 nodes the tradeoffs around ASG lock-in, provisioner complexity, and migration risk get serious. This episode breaks down when to keep Cluster Autoscaler, when Karpenter wins, and how to articulate both sides clearly in a senior DevOps or SRE interview. Covers real configuration details, scaling latency numbers, and common wrong answers interviewers flag.
Full interview prep guides and scenario walkthroughs: DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
OOMKilled at Scale: Tuning JVM Heap in Kubernetes
5 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
A Java service keeps getting OOMKilled in Kubernetes even though memory requests look fine on paper. This episode explains why JVM heap defaults ignore container limits, how to set maximum heap size correctly, and what interviewers expect when they probe your understanding of Java memory in containerized environments. Covers Xmx flags, UseContainerSupport, native memory overhead, and the tradeoffs between requests and limits.
Full interview prep guides and scenario walkthroughs: DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Karpenter Spot Interruption: Fallback & Graceful Drain
4 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
When AWS fires the 2-minute Spot reclaim notice, Karpenter's interruption queue is the difference between a blip and a batch job disaster — here's exactly how to configure it.
You'll learn:
How to set karpenter.sh/capacity-type in a NodePool to prefer Spot with automatic On-Demand fallbackThe full interruption flow: SQS queue → cordon → graceful drain → pod rescheduling, all within the 2-minute windowWhy the order of values in the capacity-type array doesn't control selection — Karpenter uses price-capacity optimizationWhen to use strict values: ['spot'] and what happens when capacity dries upWhy Pod Disruption Budgets and gracefulTerminationPeriod are non-negotiable for fault-tolerant batch workloads
Keywords: Karpenter Spot interruption handling, Spot instance fallback on-demand, NodePool capacity type configuration, Kubernetes batch workload cost optimization, Spot 2-minute warning drain
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Canary Analysis for Flink Streaming: Prometheus, Loki & Pyroscope
4 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Automated canary analysis for a Flink-based streaming app is a common senior SRE interview scenario — here's how to wire Prometheus, Loki, and Pyroscope into a production-grade rollout strategy.
You'll learn:
How to define canary success criteria using Prometheus metrics like consumer lag, throughput, and error rate on Flink jobsUsing Loki log queries to surface structured errors in canary vs. baseline deployments side-by-sideContinuous profiling with Pyroscope to catch CPU or memory regressions in the new Flink version before full rolloutHow automated analysis gates work — failing fast vs. baking time — and how to articulate the tradeoff in an interviewStitching observability signals into a single canary decision: pass, fail, or inconclusive
Keywords: canary deployment Flink, automated canary analysis SRE, Prometheus Loki Pyroscope, streaming app observability, DevOps interview questions
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Grafana Mimir Storage: Tiered S3 at 10TB/day
4 jul· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Grafana Mimir storage at 10TB/day scale forces real trade-offs — here's how to configure tiered storage to S3 without bleeding cost or tanking query performance.
You'll learn:
How Mimir's store-gateway and compactor interact with S3-backed object storage at high ingest volumeConfiguring blocks_storage with tiered retention — keeping hot blocks in fast storage while offloading cold blocks to S3 Glacier-compatible tiersTuning compaction schedules and chunk caching (memcached) to reduce S3 GET costs under sustained 10TB/day ingestCommon pitfalls: misconfigured bucket lifecycle policies, compactor overlap errors, and index cache misses killing query latencySizing ruler and alertmanager storage separately so they don't contend with block storage I/O
Keywords: Grafana Mimir S3 storage, Mimir tiered storage config, Mimir compactor tuning, metrics storage at scale, Mimir blocks_storage
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
SLO Error Budget Burn Rate: Azure Zone Outage Math
24 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
If your service has a 99.99% SLO and Azure drops a zone for 15 minutes, here's exactly how to calculate the error budget burn rate before your next SRE interview.
You'll learn:
How to derive total monthly error budget from a 99.99% SLO (~4.38 minutes/month)Why a 15-minute outage consumes roughly 3.4x your entire monthly budget — and how to show that mathThe burn rate formula interviewers expect: burn rate = error rate / (1 − SLO target)How fast vs. slow burn rates map to alerting windows in Google's SRE workbook approachCommon gotchas: partial zone failures, dependency blame, and how to frame mitigation in your answer
Keywords: SLO error budget burn rate, Azure availability zone outage, SRE interview questions, error budget calculation, 99.99 SLO math
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
PCI-DSS Serverless Payments on GCP: Confidential VMs, CEKM & Binary Authorization
23 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Designing a PCI-DSS compliant serverless payments architecture on GCP means getting Confidential VMs, Cloud External Key Manager, and Binary Authorization working together — here's how to answer that in a senior interview.
You'll learn:
How Confidential VMs provide hardware-level memory encryption to satisfy PCI-DSS data-in-use requirementsWhy Cloud External Key Manager (CEKM) lets you hold encryption keys outside GCP's control — and what that means for scope reductionHow Binary Authorization enforces cryptographic attestation so only verified container images reach your payment workloadsThe serverless boundary decisions (Cloud Run vs bare GKE) that affect your Cardholder Data Environment scopeCommon interview gotchas around shared responsibility, audit logging with Cloud Audit Logs, and VPC Service Controls for perimeter defence
Keywords: PCI-DSS GCP architecture, Confidential VMs interview, Cloud External Key Manager, Binary Authorization Cloud Run, serverless payments compliance
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Cross-Account EKS with AWS CDK: VPC Peering and Transit Gateway
23 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Deploying EKS clusters across AWS accounts with CDK is a common senior interview scenario — here's how to handle VPC peering, Transit Gateway attachments, and IAM trust policies correctly.
You'll learn:
How to structure a multi-account CDK app using Stacks across environments with explicit env account/region targetsWhen to use VPC peering vs Transit Gateway for cross-account EKS network connectivity, and the trade-offs at scaleHow to wire up Transit Gateway attachments and route table propagation so worker nodes can reach shared servicesCross-account IAM role assumptions and EKS RBAC config required for cluster access from a management accountCommon CDK gotchas: bootstrap trust policies, asset S3 bucket permissions, and cross-account CFN execution roles
Keywords: cross-account EKS CDK, AWS Transit Gateway EKS, VPC peering Kubernetes, multi-account EKS architecture, AWS CDK EKS interview
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
OpenTelemetry + CloudWatch Logs Insights: Tracing Serverless Apps
21 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Correlating OpenTelemetry traces with CloudWatch Logs Insights across Lambda and Step Functions is a common senior interview scenario — here's exactly how to answer it.
You'll learn:
How to propagate trace context (W3C TraceContext headers) across Lambda invocations and Step Functions state transitions so trace IDs land in your structured logsConfiguring the AWS Distro for OpenTelemetry (ADOT) Lambda layer to auto-instrument functions without cold-start penaltiesWriting CloudWatch Logs Insights queries that join on trace_id to reconstruct an end-to-end execution timeline across servicesWhere correlation breaks — async Step Functions callbacks, missing X-Amzn-Trace-Id propagation, and log sampling mismatchesTrade-offs between ADOT, X-Ray native SDK, and a third-party collector like the OpenTelemetry Collector on Fargate
Keywords: OpenTelemetry Lambda tracing, CloudWatch Logs Insights trace correlation, ADOT Step Functions, serverless observability interview questions
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Terraform State Splitting: terraform state rm + moved Blocks
21 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Splitting a monolithic 4GB Terraform state file into scoped microstates is one of the nastiest live-infrastructure challenges you'll face — here's how to do it without downtime using terraform state rm and moved blocks.
You'll learn:
Why state files balloon past 4GB and why that breaks plan/apply performanceHow to use terraform state rm to surgically extract resources without destroying themUsing moved blocks to re-home resources into child state backends cleanlySequencing the migration to avoid drift, lock contention, and accidental deletesHow to validate microstate integrity with terraform state list and targeted plans before cutting over
Keywords: terraform state splitting, terraform state rm, moved blocks terraform, monorepo to microstate migration, terraform refactor interview
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Monorepo CI at Scale: Bazel Caching for 1,000 Microservices
20 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Designing a monorepo CI pipeline that doesn't collapse under 1,000 microservices means getting Bazel remote caching and selective test execution right from the start.
You'll learn:
How to structure a monorepo CI pipeline so only affected services trigger builds — using Bazel's dependency graph to compute the minimal affected setConfiguring Bazel remote caching (local cache, shared remote cache via gRPC or HTTP) to avoid rebuilding unchanged targets across parallel CI workersSelective testing strategies: combining bazel query with --build_event_stream to identify and run only impacted test targetsCommon failure modes at scale — cache poisoning, overly broad BUILD file dependencies, and flaky remote executor connectionsHow to structure the CI orchestration layer (GitHub Actions, Buildkite, or Tekton) to fan out Bazel shards without thrashing the remote cache
Keywords: monorepo CI pipeline, Bazel remote caching, selective testing microservices, CI at scale DevOps interview, platform engineering build systems
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Azure RBAC with Pulumi: Dynamic Roles from YAML
20 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Learn how to generate dynamic Azure RBAC role assignments using Pulumi with YAML-driven definitions — including tag-scoped conditions like restricting storage access to env:prod resources only.
You'll learn:
How to define custom Azure RBAC roles in YAML and hydrate them through Pulumi's automation layerUsing condition and conditionVersion fields in role assignments to enforce attribute-based access control (ABAC)Scoping storage permissions to resources matching specific tag key/value pairs at assignment timeStructuring Pulumi component resources so YAML definitions stay DRY across multiple environmentsCommon gotchas: condition syntax errors, propagation delays, and principal vs. scope mismatches
Keywords: Azure RBAC Pulumi, dynamic role assignments Azure, Pulumi YAML infrastructure, Azure ABAC tag conditions, custom RBAC roles interview
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Prometheus Cardinality: Cutting 10M Series to 500K for Istio
17 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Taming Prometheus cardinality explosion in an Istio service mesh — dropping from 10 million to 500K active series using relabel_configs and recording rules — is exactly the kind of production war story senior SRE interviews dig into.
You'll learn:
Why Istio telemetry generates cardinality explosions and which high-cardinality labels (source_workload, destination_service, pod IPs) are the usual culpritsHow to use metric_relabel_configs to drop or rewrite labels before series are ingested into TSDB storageWriting recording rules to pre-aggregate high-resolution Istio metrics into lower-cardinality rollupsUsing topk and cardinality analysis queries to identify which metrics are burning your series budgetTrade-offs between dropping labels at scrape time versus aggregating at query time — and why interviewers care about the difference
Keywords: Prometheus cardinality, Istio metrics, relabel_configs, recording rules, TSDB series limit
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Conftest in Argo CD: Block Public S3 Buckets at GitOps Gate
17 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
A developer pushes a Terraform module with a public S3 bucket — here's exactly how to catch and block it in your Argo CD pipeline using Conftest policy-as-code before it ever reaches production.
You'll learn:
How Conftest integrates with Argo CD as a pre-sync hook to enforce OPA policies on Terraform plansWriting a Rego rule that flags acl = public-read or block_public_acls = false on aws_s3_bucket resourcesWhere in the GitOps workflow the gate fires — and why admission controllers alone aren't enough for IaC driftHow to surface policy failures as Argo CD sync errors so engineers see the violation before merge, not after deployCommon gotchas: Terraform plan JSON output format, conftest namespace mismatches, and false positives on legacy modules
Keywords: Conftest Argo CD policy, OPA Terraform GitOps, block public S3 bucket IaC, GitOps security controls, Rego policy Terraform plan
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Terragrunt at Scale: Dependency Graphs, Circular Deps & OCI Versioning
17 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Managing a Terragrunt dependency graph across 500+ modules without hitting circular dependencies or version drift is one of the hardest scaling problems in platform engineering.
You'll learn:
How to map and audit a large Terragrunt dependency graph using terragrunt graph-dependencies and DAG visualisation toolsPatterns for structuring module hierarchies to prevent circular dependencies before they reach CIEnforcing module versioning with OCI registries — why OCI beats Git tags at this scaleHow to segment a 500+ module monorepo into dependency tiers so targeted runs stay fastCommon failure modes: implicit dependencies, missing mock_outputs, and run-all ordering bugs
Keywords: Terragrunt dependency graph, Terragrunt at scale, OCI module registry, circular dependencies Terraform, platform engineering IaC
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
External Secrets Operator: Vault Dynamic Secrets in Kubernetes Without Sidecars
17 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
External Secrets Operator lets you sync HashiCorp Vault dynamic secrets directly into Kubernetes Secrets — no Vault Agent sidecars, no annotation sprawl.
You'll learn:
How ESO's ExternalSecret and SecretStore CRDs map Vault paths to Kubernetes SecretsWhy dynamic secrets (short-lived, auto-rotated) are preferable to static tokens and how ESO handles lease renewalThe auth methods ESO supports for talking to Vault — Kubernetes auth vs. AppRole and when to use eachCommon failure modes: stale secrets after Vault seal, RBAC misconfigs, and refresh interval gotchasHow to scope a ClusterSecretStore safely across namespaces without over-permissioning
Keywords: External Secrets Operator, HashiCorp Vault Kubernetes integration, dynamic secrets management, Vault sidecar alternative, Kubernetes secrets sync
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Jenkins Helm Deadlocks: Diagnose with jstack and Mutex Locks
16 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Parallel Jenkins jobs deploying Helm charts can deadlock silently — here's how to catch and fix mutex contention before it kills your pipeline.
You'll learn:
Why concurrent Helm deploys compete for the same release lock and how that surfaces as a deadlock in JenkinsHow to run jstack against the Jenkins JVM to capture thread dumps and identify which threads are waiting on a monitor lockReading mutex lock output to pinpoint the blocked executor and the thread holding itHelm-side mitigations: namespace isolation, --atomic flag behaviour, and serialising releases with lockfiles or pipeline lock() stepsWhen to escalate from a workaround to a structural fix — separate agents, dedicated namespaces, or a Helm operator pattern
Keywords: Jenkins parallel jobs deadlock, Helm chart deployment lock, jstack thread dump Jenkins, mutex lock CI/CD pipeline, Jenkins pipeline concurrency
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
CloudFormation Drift Detection: AWS Config + Lambda Auto-Remediation
16 jun· DevOps & Cloud Interview Prep: Real Scenarios & Answers
Learn how to enforce CloudFormation stack drift detection at scale using AWS Config rules and Lambda-driven auto-remediation — a common architecture question in senior Cloud and DevOps interviews.
You'll learn:
How AWS Config detects configuration drift against CloudFormation expected stack states using managed and custom rulesWiring an EventBridge rule to trigger a Lambda function when Config flags a stack as DRIFTEDLambda remediation patterns: re-running cloudformation detect-stack-drift vs. forcing a stack update to reconcile out-of-band changesGotchas around drift detection cost, IAM permissions for the Config recorder, and distinguishing intentional changes from real driftHow to scope remediation safely — alerting vs. hard auto-rollback and when each is appropriate in production
Keywords: CloudFormation drift detection, AWS Config auto-remediation, Lambda CloudFormation remediation, IaC drift enforcement, AWS Config rules interview
🎧 Listen, then go deeper — DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visa fler

Avsnitt

Kubernetes Scheduler Extenders: Custom Placement Logic

The Kubernetes Machine: From kubectl apply to Running Containers

Cluster Autoscaler vs Karpenter: Choosing at 500 Nodes

OOMKilled at Scale: Tuning JVM Heap in Kubernetes

Karpenter Spot Interruption: Fallback & Graceful Drain

Canary Analysis for Flink Streaming: Prometheus, Loki & Pyroscope

Grafana Mimir Storage: Tiered S3 at 10TB/day

SLO Error Budget Burn Rate: Azure Zone Outage Math

PCI-DSS Serverless Payments on GCP: Confidential VMs, CEKM & Binary Authorization

Cross-Account EKS with AWS CDK: VPC Peering and Transit Gateway

OpenTelemetry + CloudWatch Logs Insights: Tracing Serverless Apps

Terraform State Splitting: terraform state rm + moved Blocks

Monorepo CI at Scale: Bazel Caching for 1,000 Microservices

Azure RBAC with Pulumi: Dynamic Roles from YAML

Prometheus Cardinality: Cutting 10M Series to 500K for Istio

Conftest in Argo CD: Block Public S3 Buckets at GitOps Gate

Terragrunt at Scale: Dependency Graphs, Circular Deps & OCI Versioning

External Secrets Operator: Vault Dynamic Secrets in Kubernetes Without Sidecars

Jenkins Helm Deadlocks: Diagnose with jstack and Mutex Locks