Avsnitt
-
If your service has a 99.99% SLO and Azure drops a zone for 15 minutes, here's exactly how to calculate the error budget burn rate before your next SRE interview.
You'll learn:
How to derive total monthly error budget from a 99.99% SLO (~4.38 minutes/month)Why a 15-minute outage consumes roughly 3.4x your entire monthly budget β and how to show that mathThe burn rate formula interviewers expect: burn rate = error rate / (1 β SLO target)How fast vs. slow burn rates map to alerting windows in Google's SRE workbook approachCommon gotchas: partial zone failures, dependency blame, and how to frame mitigation in your answerKeywords: SLO error budget burn rate, Azure availability zone outage, SRE interview questions, error budget calculation, 99.99 SLO math
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Designing a PCI-DSS compliant serverless payments architecture on GCP means getting Confidential VMs, Cloud External Key Manager, and Binary Authorization working together β here's how to answer that in a senior interview.
You'll learn:
How Confidential VMs provide hardware-level memory encryption to satisfy PCI-DSS data-in-use requirementsWhy Cloud External Key Manager (CEKM) lets you hold encryption keys outside GCP's control β and what that means for scope reductionHow Binary Authorization enforces cryptographic attestation so only verified container images reach your payment workloadsThe serverless boundary decisions (Cloud Run vs bare GKE) that affect your Cardholder Data Environment scopeCommon interview gotchas around shared responsibility, audit logging with Cloud Audit Logs, and VPC Service Controls for perimeter defenceKeywords: PCI-DSS GCP architecture, Confidential VMs interview, Cloud External Key Manager, Binary Authorization Cloud Run, serverless payments compliance
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Saknas det avsnitt?
-
Deploying EKS clusters across AWS accounts with CDK is a common senior interview scenario β here's how to handle VPC peering, Transit Gateway attachments, and IAM trust policies correctly.
You'll learn:
How to structure a multi-account CDK app using Stacks across environments with explicit env account/region targetsWhen to use VPC peering vs Transit Gateway for cross-account EKS network connectivity, and the trade-offs at scaleHow to wire up Transit Gateway attachments and route table propagation so worker nodes can reach shared servicesCross-account IAM role assumptions and EKS RBAC config required for cluster access from a management accountCommon CDK gotchas: bootstrap trust policies, asset S3 bucket permissions, and cross-account CFN execution rolesKeywords: cross-account EKS CDK, AWS Transit Gateway EKS, VPC peering Kubernetes, multi-account EKS architecture, AWS CDK EKS interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Correlating OpenTelemetry traces with CloudWatch Logs Insights across Lambda and Step Functions is a common senior interview scenario β here's exactly how to answer it.
You'll learn:
How to propagate trace context (W3C TraceContext headers) across Lambda invocations and Step Functions state transitions so trace IDs land in your structured logsConfiguring the AWS Distro for OpenTelemetry (ADOT) Lambda layer to auto-instrument functions without cold-start penaltiesWriting CloudWatch Logs Insights queries that join on trace_id to reconstruct an end-to-end execution timeline across servicesWhere correlation breaks β async Step Functions callbacks, missing X-Amzn-Trace-Id propagation, and log sampling mismatchesTrade-offs between ADOT, X-Ray native SDK, and a third-party collector like the OpenTelemetry Collector on FargateKeywords: OpenTelemetry Lambda tracing, CloudWatch Logs Insights trace correlation, ADOT Step Functions, serverless observability interview questions
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Splitting a monolithic 4GB Terraform state file into scoped microstates is one of the nastiest live-infrastructure challenges you'll face β here's how to do it without downtime using terraform state rm and moved blocks.
You'll learn:
Why state files balloon past 4GB and why that breaks plan/apply performanceHow to use terraform state rm to surgically extract resources without destroying themUsing moved blocks to re-home resources into child state backends cleanlySequencing the migration to avoid drift, lock contention, and accidental deletesHow to validate microstate integrity with terraform state list and targeted plans before cutting overKeywords: terraform state splitting, terraform state rm, moved blocks terraform, monorepo to microstate migration, terraform refactor interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Designing a monorepo CI pipeline that doesn't collapse under 1,000 microservices means getting Bazel remote caching and selective test execution right from the start.
You'll learn:
How to structure a monorepo CI pipeline so only affected services trigger builds β using Bazel's dependency graph to compute the minimal affected setConfiguring Bazel remote caching (local cache, shared remote cache via gRPC or HTTP) to avoid rebuilding unchanged targets across parallel CI workersSelective testing strategies: combining bazel query with --build_event_stream to identify and run only impacted test targetsCommon failure modes at scale β cache poisoning, overly broad BUILD file dependencies, and flaky remote executor connectionsHow to structure the CI orchestration layer (GitHub Actions, Buildkite, or Tekton) to fan out Bazel shards without thrashing the remote cacheKeywords: monorepo CI pipeline, Bazel remote caching, selective testing microservices, CI at scale DevOps interview, platform engineering build systems
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Learn how to generate dynamic Azure RBAC role assignments using Pulumi with YAML-driven definitions β including tag-scoped conditions like restricting storage access to env:prod resources only.
You'll learn:
How to define custom Azure RBAC roles in YAML and hydrate them through Pulumi's automation layerUsing condition and conditionVersion fields in role assignments to enforce attribute-based access control (ABAC)Scoping storage permissions to resources matching specific tag key/value pairs at assignment timeStructuring Pulumi component resources so YAML definitions stay DRY across multiple environmentsCommon gotchas: condition syntax errors, propagation delays, and principal vs. scope mismatchesKeywords: Azure RBAC Pulumi, dynamic role assignments Azure, Pulumi YAML infrastructure, Azure ABAC tag conditions, custom RBAC roles interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Taming Prometheus cardinality explosion in an Istio service mesh β dropping from 10 million to 500K active series using relabel_configs and recording rules β is exactly the kind of production war story senior SRE interviews dig into.
You'll learn:
Why Istio telemetry generates cardinality explosions and which high-cardinality labels (source_workload, destination_service, pod IPs) are the usual culpritsHow to use metric_relabel_configs to drop or rewrite labels before series are ingested into TSDB storageWriting recording rules to pre-aggregate high-resolution Istio metrics into lower-cardinality rollupsUsing topk and cardinality analysis queries to identify which metrics are burning your series budgetTrade-offs between dropping labels at scrape time versus aggregating at query time β and why interviewers care about the differenceKeywords: Prometheus cardinality, Istio metrics, relabel_configs, recording rules, TSDB series limit
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
A developer pushes a Terraform module with a public S3 bucket β here's exactly how to catch and block it in your Argo CD pipeline using Conftest policy-as-code before it ever reaches production.
You'll learn:
How Conftest integrates with Argo CD as a pre-sync hook to enforce OPA policies on Terraform plansWriting a Rego rule that flags acl = public-read or block_public_acls = false on aws_s3_bucket resourcesWhere in the GitOps workflow the gate fires β and why admission controllers alone aren't enough for IaC driftHow to surface policy failures as Argo CD sync errors so engineers see the violation before merge, not after deployCommon gotchas: Terraform plan JSON output format, conftest namespace mismatches, and false positives on legacy modulesKeywords: Conftest Argo CD policy, OPA Terraform GitOps, block public S3 bucket IaC, GitOps security controls, Rego policy Terraform plan
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Managing a Terragrunt dependency graph across 500+ modules without hitting circular dependencies or version drift is one of the hardest scaling problems in platform engineering.
You'll learn:
How to map and audit a large Terragrunt dependency graph using terragrunt graph-dependencies and DAG visualisation toolsPatterns for structuring module hierarchies to prevent circular dependencies before they reach CIEnforcing module versioning with OCI registries β why OCI beats Git tags at this scaleHow to segment a 500+ module monorepo into dependency tiers so targeted runs stay fastCommon failure modes: implicit dependencies, missing mock_outputs, and run-all ordering bugsKeywords: Terragrunt dependency graph, Terragrunt at scale, OCI module registry, circular dependencies Terraform, platform engineering IaC
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
External Secrets Operator lets you sync HashiCorp Vault dynamic secrets directly into Kubernetes Secrets β no Vault Agent sidecars, no annotation sprawl.
You'll learn:
How ESO's ExternalSecret and SecretStore CRDs map Vault paths to Kubernetes SecretsWhy dynamic secrets (short-lived, auto-rotated) are preferable to static tokens and how ESO handles lease renewalThe auth methods ESO supports for talking to Vault β Kubernetes auth vs. AppRole and when to use eachCommon failure modes: stale secrets after Vault seal, RBAC misconfigs, and refresh interval gotchasHow to scope a ClusterSecretStore safely across namespaces without over-permissioningKeywords: External Secrets Operator, HashiCorp Vault Kubernetes integration, dynamic secrets management, Vault sidecar alternative, Kubernetes secrets sync
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Parallel Jenkins jobs deploying Helm charts can deadlock silently β here's how to catch and fix mutex contention before it kills your pipeline.
You'll learn:
Why concurrent Helm deploys compete for the same release lock and how that surfaces as a deadlock in JenkinsHow to run jstack against the Jenkins JVM to capture thread dumps and identify which threads are waiting on a monitor lockReading mutex lock output to pinpoint the blocked executor and the thread holding itHelm-side mitigations: namespace isolation, --atomic flag behaviour, and serialising releases with lockfiles or pipeline lock() stepsWhen to escalate from a workaround to a structural fix β separate agents, dedicated namespaces, or a Helm operator patternKeywords: Jenkins parallel jobs deadlock, Helm chart deployment lock, jstack thread dump Jenkins, mutex lock CI/CD pipeline, Jenkins pipeline concurrency
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Learn how to enforce CloudFormation stack drift detection at scale using AWS Config rules and Lambda-driven auto-remediation β a common architecture question in senior Cloud and DevOps interviews.
You'll learn:
How AWS Config detects configuration drift against CloudFormation expected stack states using managed and custom rulesWiring an EventBridge rule to trigger a Lambda function when Config flags a stack as DRIFTEDLambda remediation patterns: re-running cloudformation detect-stack-drift vs. forcing a stack update to reconcile out-of-band changesGotchas around drift detection cost, IAM permissions for the Config recorder, and distinguishing intentional changes from real driftHow to scope remediation safely β alerting vs. hard auto-rollback and when each is appropriate in productionKeywords: CloudFormation drift detection, AWS Config auto-remediation, Lambda CloudFormation remediation, IaC drift enforcement, AWS Config rules interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Reducing DynamoDB Global Tables data transfer costs by 70% is achievable in a multi-region Active-Active setup β if you know where the money is actually going.
You'll learn:
Why replicated write costs dominate in DynamoDB Global Tables and how to model them accuratelyUsing write sharding and conditional writes to reduce unnecessary replication trafficDAX (DynamoDB Accelerator) placement per region to cut cross-region read fallbackArchitecting read patterns to stay local β avoiding the latency and cost of cross-region readsCost monitoring with AWS Cost Explorer tags scoped to replication vs. application trafficKeywords: DynamoDB Global Tables cost optimization, multi-region Active-Active AWS, DynamoDB replication costs, AWS data transfer cost reduction
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When a database migration fails mid-deploy, your Kubernetes job hooks and Flyway versioning strategy are the difference between a five-minute fix and a 2am incident.
You'll learn:
How to structure Flyway versioned and undo migrations so a failed V3 doesn't leave your schema in a half-applied stateUsing Kubernetes init containers and Job postStart/preStop hooks to gate application rollout on migration success or failureWhy flyway repair matters when checksums break and how to use it safely in CI pipelinesPatterns for keeping application code and schema changes in sync across canary and blue-green deploymentsWhat interviewers actually want to hear when they ask about zero-downtime schema migrations at scaleKeywords: Flyway rollback strategy, Kubernetes job hooks database, schema versioning DevOps interview, failed database migration recovery
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When terraform apply times out creating 100+ IAM roles, the culprit is usually AWS API throttling combined with Terraform's default parallelism β here's how to fix it.
You'll learn:
Why the default parallelism=10 isn't always safe and when raising it to -parallelism=50 helps vs. hurtsHow AWS IAM's eventual-consistency model causes race conditions during bulk role creationBatching strategies: splitting large role sets across multiple terraform apply runs or using for_each with targeted appliesReading AWS API throttle errors in Terraform debug output (TF_LOG=DEBUG) to confirm the real bottleneckExponential backoff and retry tuning via the AWS provider's max_retries settingKeywords: terraform apply timeout, AWS IAM role throttling, terraform parallelism, terraform at scale, IAM API rate limits
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When GitHub Actions pipelines hit thousands of daily builds, your runner strategy becomes a first-class infrastructure decision β here's how to choose between self-hosted runners, larger hosted runners, and the Kubernetes executor.
You'll learn:
How GitHub-hosted larger runners (up to 64-core) reduce ops overhead versus self-hosted, and where the cost curve flipsSelf-hosted runner autoscaling with actions-runner-controller (ARC) on Kubernetes β ephemeral pods per job, KEDA-based scaling triggersKubernetes executor trade-offs: pod startup latency, RBAC isolation, and shared caching via persistent volumes or S3-backed artifact storesQueue depth, job concurrency limits, and why runner group segmentation matters at 10K+ builds per dayCommon failure modes: runner re-use contamination, Docker-in-Docker socket conflicts, and rate-limit exhaustion on the GitHub APIKeywords: GitHub Actions self-hosted runners, actions-runner-controller Kubernetes, scaling CI pipelines, GitHub larger runners, ARC autoscaling
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Enforcing FIPS 140-3 compliance on an EKS cluster means locking down every layer β from the OS to the key management hardware β and this episode walks through exactly how Bottlerocket and AWS KMS make that possible.
You'll learn:
Why Bottlerocket OS ships with a FIPS-validated kernel and how to verify its cryptographic module status at node bootstrapHow AWS KMS custom key stores backed by CloudHSM satisfy the hardware security module requirement under FIPS 140-3Enforcing TLS 1.2+ with FIPS-approved cipher suites across EKS control plane and data plane communicationIAM and pod-level controls to ensure workloads only call FIPS-compliant API endpointsCommon audit failures β weak cipher negotiation, unvalidated node images β and how to catch them before an assessor doesKeywords: FIPS 140-3 EKS, Bottlerocket FIPS compliance, AWS KMS CloudHSM, EKS security hardening, FIPS validated Kubernetes
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When you're drowning in 1,000+ alerts a day, AWS Lookout for Metrics can route only the anomalies that matter directly to Slack or Teams β here's how to wire it up.
You'll learn:
How AWS Lookout for Metrics uses ML to separate real anomalies from noise across CloudWatch, S3, and RDS data sourcesRouting detected anomalies to Slack or Microsoft Teams via SNS topics and Lambda webhook integrationsTuning sensitivity thresholds to reduce false positives without missing critical incidentsGrouping related alerts into a single notification so on-call engineers see context, not a flood of individual triggersWhere Lookout for Metrics fits alongside existing tools like PagerDuty, OpsGenie, and CloudWatch AlarmsKeywords: alert fatigue DevOps, AWS Lookout for Metrics, ML anomaly detection AWS, Slack alerting pipeline, SRE on-call interview questions
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Auditing cross-account IAM roles is one of those senior interview topics where vague answers kill your chances β here's how to use AWS IAM Access Analyzer and Policy Sentry to give a precise, credible response.
You'll learn:
How IAM Access Analyzer detects externally accessible roles and flags unintended cross-account trust relationshipsHow Policy Sentry helps you write and audit least-privilege IAM policies by mapping actions to resource ARNsThe difference between resource-based and identity-based policy analysis β and why interviewers expect you to know bothHow to interpret Access Analyzer findings and translate them into remediation steps during a live interviewCommon gotchas: why a role with no findings isn't necessarily safe, and how SCPs interact with cross-account accessKeywords: cross-account IAM roles, AWS IAM Access Analyzer, Policy Sentry, least privilege IAM, cloud security interview questions
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Visa fler