Making systems reliable, deployments boring, and incidents rare.

[email protected] · alenabraham.me · GitHub · LinkedIn


Experience

Qure.ai
Medical imaging AI company using deep learning to detect critical findings in X-rays, CT scans, and emergency radiology — deployed across 15+ AWS regions serving hospitals globally.
Full-time · 3 yrs 7 mos+ · Bengaluru, India
Senior DevOps Engineer
Apr 2026 – Present
  • Reduced Datadog costs from $50K to $30K/month ($240K annual savings) by implementing Flex Logs and optimizing log ingestion pipelines
  • Built Jenkins monitoring dashboard on Datadog, giving developers real-time visibility into CI/CD pipeline health and build status
  • Building a public service status page (status.qure.ai) with heartbeat monitoring for real-time uptime visibility across all services
  • Migrating CPU-intensive CI tasks from Jenkins to GitHub Actions with S3 caching to improve build performance and reduce infrastructure load
  • Driving reliability and scalability improvements for qTrack — optimizing performance for production healthcare workloads
Senior Site Reliability Engineer
Apr 2025 – Mar 2026 · 1 yr
  • Designed SLO tracking using Turn Around Time (TAT) as the primary SLI, distinguishing bulk uploads from urgent scans to enable data-driven reliability decisions across 15+ regions
  • Implemented end-to-end distributed tracing using Datadog APM with custom trace tags, reducing cross-service debugging time across 3 microservices
  • Developed custom Datadog metrics pipeline — Series API for dashboards/alerting (low cardinality), Events API for debugging (high cardinality) — cutting observability costs while maintaining full debuggability
  • Architected Bazel-based smart build system for 39+ package monorepo with reverse dependency resolution and parallel dispatch via semaphore-based ordering
  • Delivered 3-stage deployment pipeline (TEST → PUBLISH → DEPLOY) with CodeDeploy, decoupling application deployments from infrastructure changes
  • Owned Jenkinsfiles and CI/CD workflows across 12+ services, standardizing build, test, and release processes using a shared Jenkins library (Hawkeye)
  • Engineered self-healing EC2 instances using IMDSv2 failure detection with automatic ASG replacement, reducing recovery time to under 5 minutes with zero manual intervention
  • Optimized ASG auto-scaling (CPU 90%/30%, 10-min evaluation, 5-min cooldown) and reduced ALB deregistration from 300s to 30s, accelerating rolling deployments by 10x
  • Deployed on-premise medical imaging solutions across 5+ countries including UAE (SEHA visa screening, Burjeel hospital CT/X-ray, MOH UAE) and Vietnam with end-to-end server setup and DICOM modality integration
  • Developed "Pulse" — an internal monitoring platform (Django, React, TypeScript, PostgreSQL) providing real-time health dashboards for on-premise services and DICOM gateway infrastructure globally
  • Building "Agent-Qurie" — an AI-powered knowledge base portal (LiteLLM, Open-WebUI, Prometheus) enabling vendors to self-serve L1 incident resolution, reducing escalations to the SRE team
Site Reliability Engineer
Mar 2023 – Mar 2025 · 2 yrs
  • Created reusable AWS CDK construct library deploying across 15+ production regions (AWS, Huawei Cloud, Alibaba Cloud) with Pydantic-validated configs preventing misconfigurations before production
  • Integrated Bandit SAST scanning on every commit with baseline comparison; established split-PR enforcement to prevent cross-service merge conflicts
  • Containerized multiple services with Docker and Docker Compose across staging, production, and on-premise environments with environment-specific configurations
  • Co-built internal license management platform (Django + React) handling license lifecycle and deployment coordination for cloud and air-gapped hospital environments
  • Established per-region p95/p99 TAT dashboards and error rate monitoring, enabling real-time service visibility and SLO-driven deployment decisions
  • Authored SRE Knowledge Base with runbooks, incident response procedures, onboarding guides, and operational documentation adopted across the engineering team
  • Maintained and operated Fomema — a legacy healthcare client on Alibaba Cloud since 2022, handling ongoing infrastructure management, incident resolution, and platform stability
Technical Operations Engineer
Sep 2022 – Feb 2023 · 6 mos
  • Resolved L1-L3 production issues across cloud and on-premise environments — debugging distributed systems, container failures, network misconfigurations, and application-level errors
  • Collaborated cross-functionally with backend, frontend, product, QA, and BD teams; provided infrastructure cost analysis for solution pricing and client proposals
  • Led incident response across 15+ regions with on-call rotations including weekends; conducted client-facing technical scoping and served as interim TPM for select clients
Tata Consultancy Services
Trivandrum, India
Assistant System Engineer
Jul 2021 – Sep 2022 · 1 yr 3 mos
  • Built Jenkins CI/CD pipelines for Java Spring Boot applications deployed on GCP; volunteered for 24x7 on-call
Naas.ai
Remote
Open Source Contributor
Dec 2021 – Aug 2022 · 9 mos
  • Design consultant; contributing to documentation and DevOps
Cognetry Labs
Trivandrum, India
Technical Intern
Nov 2020 – Feb 2021 · 4 mos
  • Redesigned company website and designed interfaces for a mobile app and admin panel

Skills

SRE & ObservabilitySLOs/SLIs, Datadog (APM, Metrics, Tracing), Incident Response, Blameless Postmortems, On-Call
Cloud & InfrastructureAWS (EC2, ASG, ALB, RDS, EFS, S3, CodeDeploy), Huawei Cloud, Alibaba Cloud, Viettel Cloud
IaC & ContainersAWS CDK (Python), CloudFormation, Ansible, Docker, Docker Compose
CI/CD & AutomationJenkins, Bazel, GitHub Actions, CodeDeploy, Bandit SAST
Languages & FrameworksPython, Java, TypeScript, Django, React, SQL, Bash
ToolsGit, Teleport, Jira, Postman, Cloudflare, Claude Code, Cursor

Education

College of Engineering Chengannur
2017 – 2021
B.Tech (Hons.) in Electronics and Communication Engineering · CGPA: 8.1/10

Languages

English · German · Malayalam