Usman Shahid

u.manshahid@gmail.com · +971-52-932-8265 · github.com/codemug · linkedin.com/in/u-manshahid

Open to relocation: EU / UK / Ireland / Netherlands / Germany — visa sponsorship required. Currently Dubai (GMT+4).

Download PDF

Summary

Staff Site Reliability Engineer and Technical Lead with 11+ years scaling resilient, cost-effective cloud infrastructure for high-traffic services. Currently running a self-service AI Agent Platform with Model Context Protocol (MCP) tool-calling, multi-agent orchestration, and human-in-the-loop (HITL) approval flows — ~200K requests/day, ~800 registered tools, ~60 production agents — bridging classic SRE discipline with the agentic-AI era. Track record: 40% EKS cost reduction via Karpenter / ARM / Spot / VPA, 99.99% uptime consolidating 12 API gateways to Kong at ~10k RPS, 300+ microservices migrated to EKS. Polyglot — production Go, Python, Java, and C#. Daily AI-tooling user (Claude, Cursor, Copilot, Gemini).

Work Experience

Staff Site Reliability Engineer · Careem

Jan 2023 — Present

Technical Lead and SRE Architect mentoring a team focused on infrastructure provisioning and developer experience.

  • Designed and built a self-service AI Agent Platform with MCP tool-calling, multi-agent orchestration, and HITL approval flows for internal infrastructure-support workflows — ~200K req/day, ~800 registered tools, ~60 agents in production. Engineering teams contribute their own tools and agents.
  • Slashed EKS cluster costs by 40% via resource-optimization program — Karpenter consolidation, migration to ARM & Spot nodes, VPA, and custom scheduling (FinOps at scale).
  • Consolidated 12 legacy API gateways (~10k RPS) onto a single Kong gateway with 99.99% uptime during cutover; built a GitOps-driven bespoke control plane for API route management.
  • Designed and deployed a fully automated cloud-resource provisioning platform with GitOps (Terraform + Terragrunt) — provisioning time for new services from days to minutes.

Senior Site Reliability Engineer · Careem

Dec 2020 — Jan 2023

Led migration of 300+ microservices from legacy AWS to AWS EKS.

  • Engineered automated provisioning of multiple EKS clusters with Terraform + Terragrunt — repeatable, auditable infrastructure foundation.
  • Provisioned centralized API Gateways using Kong with a GitOps control plane for API route management.
  • Built the observability stack with Prometheus, Thanos, and Grafana — high-cardinality metric storage, SLO dashboards, cost tracking across the EKS fleet.
  • Improved deployment velocity and safety via service mesh (Linkerd) with automated canary rollouts via Flagger.
  • On-call 24×7 on a bi-weekly rotation for the cloud-infrastructure platform.

Infrastructure Architect & Team Lead · Intech IIS

Jul 2019 — Dec 2020
  • Designed and implemented the infrastructure of a Cloud IIoT Platform while leading a team of 6 polyglot engineers in a KanBan model — workload distribution, technology selection, mentoring, customer integration.
  • Built a bespoke Kubernetes Ingress Operator for bare-metal Kubernetes clusters using RedHat operator-sdk (custom control-plane work, not just operator consumption).
  • Secured & streamlined the software delivery lifecycle by integrating ArgoCD with Keycloak — a secure, auditable GitOps workflow for Helm-based deployments.
  • Implemented Google SRE golden signals (latency / traffic / errors / saturation) on bare-metal Kubernetes clusters using Prometheus.

Senior Software Developer / DevOps Engineer · Intech IIS

Jan 2017 — Jun 2019
  • Containerized all services of a distributed Industrial IoT platform with Docker; orchestrated via Kubernetes and D2iQ stacks.
  • Automated deployment of the IIoT platform using Ansible Playbooks; provisioned production-ready Kubernetes and D2iQ clusters.
  • Built a serverless framework in Python running code on containers via Kubernetes / Marathon / Metronome.
  • Built a parallel metric aggregation publisher for StatsD in Golang.
  • Built an OPC → MQTT data pipeline service in C# using Avro RPC and REST.
  • Built a generic client API using Java Streams for CRUD operations on IBM Maximo and SAP EAM systems via Avro schemas, exposed via Avro RPC.
  • Built an OpenID Connect Identity Provider microservice using Spring OAuth2 with JDBC, Active Directory, LDAP, and Mongo storage backends (Spring JPA + Spring LDAP).
  • Implemented dynamic proxying and JWT bearer authentication using OpenResty LuaJIT.

Software Developer · Diyatech (Alachisoft)

Jun 2014 — Jan 2017

Core developer for NCache Enterprise (Redis-class distributed cache for .NET) and NosDB Enterprise (CouchDB/MongoDB-class NoSQL JSON store for .NET), both in C#.

  • Reimplemented Microsoft's Collections to prevent Large Object Heap (LOH) leakage.
  • Implemented the ASP.NET Core Session State Management middleware with NCache backend storage and full session-locking support.
  • Implemented a reliable socket protocol for message delivery using Protobuf.
  • Built file-storage-based B+ Tree indexing with support for single-attribute and compound indexes.
  • Built the query execution system using Streams and nested Enumerables in C#.

Skills

Site Reliability / Platform:
SRE practices, SLO/SLI/error budgets, on-call & incident response, GitOps, Internal Developer Platforms, FinOps, capacity planning
Kubernetes ecosystem:
Kubernetes (managed + bare-metal), Karpenter, VPA, custom scheduling, Linkerd, Flagger, Helm, Kustomize, ArgoCD, operator-sdk (custom Go operators)
Cloud:
AWS (EKS, EC2, S3, VPC, Route 53, IAM), bare-metal Kubernetes, multi-cluster patterns
Observability:
Prometheus, Thanos, Grafana, Google SRE golden signals, SLO dashboards
Infrastructure & CI/CD:
Terraform, Terragrunt, Ansible, Docker, D2iQ, GitLab CI
API & Gateway:
Kong (12→1 consolidation, ~10k RPS, 99.99% uptime), OpenResty / LuaJIT, OpenID Connect, JWT, dynamic proxying
AI / LLMOps:
Model Context Protocol (MCP), multi-agent orchestration, HITL approval flows, AI Agent Platform engineering; daily use of Claude, Cursor, GitHub Copilot, Gemini
Languages:
Go (production), Python (production), Java (production), C# / .NET (production)
Data / Storage internals:
B+ Tree indexing (NosDB), distributed-cache internals (NCache), Mongo, LDAP, JDBC, Avro / Protobuf

Selected Writing

Awards

Excellence Award, Alachisoft Inc. — outstanding performance recognition (November 2015)

Education

Bachelors (Honors) in Computer Engineering, National University of Sciences and Technology (NUST). CGPA: 3.40. Graduated 2014.