Making systems reliable, deployments boring, and incidents rare.
[email protected] · alenabraham.me · GitHub · LinkedIn
Experience
Qure.ai
Senior DevOps Engineer
Apr 2026 – Present
- Reduced Datadog costs from $50K to $30K/month ($240K annual savings) by implementing Flex Logs and optimizing log ingestion pipelines
- Built Jenkins monitoring dashboard on Datadog, giving developers real-time visibility into CI/CD pipeline health and build status
- Building a public service status page (status.qure.ai) with heartbeat monitoring for real-time uptime visibility across all services
- Migrating CPU-intensive CI tasks from Jenkins to GitHub Actions with S3 caching to improve build performance and reduce infrastructure load
- Driving reliability and scalability improvements for qTrack — optimizing performance for production healthcare workloads
Senior Site Reliability Engineer
Apr 2025 – Mar 2026 · 1 yr
- Designed SLO tracking using Turn Around Time (TAT) as the primary SLI, distinguishing bulk uploads from urgent scans to enable data-driven reliability decisions across 15+ regions
- Implemented end-to-end distributed tracing using Datadog APM with custom trace tags, reducing cross-service debugging time across 3 microservices
- Developed custom Datadog metrics pipeline — Series API for dashboards/alerting (low cardinality), Events API for debugging (high cardinality) — cutting observability costs while maintaining full debuggability
- Architected Bazel-based smart build system for 39+ package monorepo with reverse dependency resolution and parallel dispatch via semaphore-based ordering
- Delivered 3-stage deployment pipeline (TEST → PUBLISH → DEPLOY) with CodeDeploy, decoupling application deployments from infrastructure changes
- Owned Jenkinsfiles and CI/CD workflows across 12+ services, standardizing build, test, and release processes using a shared Jenkins library (Hawkeye)
- Engineered self-healing EC2 instances using IMDSv2 failure detection with automatic ASG replacement, reducing recovery time to under 5 minutes with zero manual intervention
- Optimized ASG auto-scaling (CPU 90%/30%, 10-min evaluation, 5-min cooldown) and reduced ALB deregistration from 300s to 30s, accelerating rolling deployments by 10x
- Deployed on-premise medical imaging solutions across 5+ countries including UAE (SEHA visa screening, Burjeel hospital CT/X-ray, MOH UAE) and Vietnam with end-to-end server setup and DICOM modality integration
- Developed "Pulse" — an internal monitoring platform (Django, React, TypeScript, PostgreSQL) providing real-time health dashboards for on-premise services and DICOM gateway infrastructure globally
- Building "Agent-Qurie" — an AI-powered knowledge base portal (LiteLLM, Open-WebUI, Prometheus) enabling vendors to self-serve L1 incident resolution, reducing escalations to the SRE team
Site Reliability Engineer
Mar 2023 – Mar 2025 · 2 yrs
- Created reusable AWS CDK construct library deploying across 15+ production regions (AWS, Huawei Cloud, Alibaba Cloud) with Pydantic-validated configs preventing misconfigurations before production
- Integrated Bandit SAST scanning on every commit with baseline comparison; established split-PR enforcement to prevent cross-service merge conflicts
- Containerized multiple services with Docker and Docker Compose across staging, production, and on-premise environments with environment-specific configurations
- Co-built internal license management platform (Django + React) handling license lifecycle and deployment coordination for cloud and air-gapped hospital environments
- Established per-region p95/p99 TAT dashboards and error rate monitoring, enabling real-time service visibility and SLO-driven deployment decisions
- Authored SRE Knowledge Base with runbooks, incident response procedures, onboarding guides, and operational documentation adopted across the engineering team
- Maintained and operated Fomema — a legacy healthcare client on Alibaba Cloud since 2022, handling ongoing infrastructure management, incident resolution, and platform stability
Technical Operations Engineer
Sep 2022 – Feb 2023 · 6 mos
- Resolved L1-L3 production issues across cloud and on-premise environments — debugging distributed systems, container failures, network misconfigurations, and application-level errors
- Collaborated cross-functionally with backend, frontend, product, QA, and BD teams; provided infrastructure cost analysis for solution pricing and client proposals
- Led incident response across 15+ regions with on-call rotations including weekends; conducted client-facing technical scoping and served as interim TPM for select clients
Tata Consultancy Services
Assistant System Engineer
Jul 2021 – Sep 2022 · 1 yr 3 mos
- Built Jenkins CI/CD pipelines for Java Spring Boot applications deployed on GCP; volunteered for 24x7 on-call
Naas.ai
Open Source Contributor
Dec 2021 – Aug 2022 · 9 mos
- Design consultant; contributing to documentation and DevOps
Cognetry Labs
Technical Intern
Nov 2020 – Feb 2021 · 4 mos
- Redesigned company website and designed interfaces for a mobile app and admin panel
Skills
SRE & ObservabilitySLOs/SLIs, Datadog (APM, Metrics, Tracing), Incident Response, Blameless Postmortems, On-Call
Cloud & InfrastructureAWS (EC2, ASG, ALB, RDS, EFS, S3, CodeDeploy), Huawei Cloud, Alibaba Cloud, Viettel Cloud
IaC & ContainersAWS CDK (Python), CloudFormation, Ansible, Docker, Docker Compose
CI/CD & AutomationJenkins, Bazel, GitHub Actions, CodeDeploy, Bandit SAST
Languages & FrameworksPython, Java, TypeScript, Django, React, SQL, Bash
ToolsGit, Teleport, Jira, Postman, Cloudflare, Claude Code, Cursor
Education
College of Engineering Chengannur
2017 – 2021
B.Tech (Hons.) in Electronics and Communication Engineering · CGPA: 8.1/10
Languages
English · German · Malayalam