Usman Shahid
u.manshahid@gmail.com · +971-52-932-8265 · github.com/codemug · linkedin.com/in/u-manshahid
Open to relocation: EU / UK / Ireland / Netherlands / Germany — visa sponsorship required. Currently Dubai (GMT+4).
Download PDFSummary
Staff Site Reliability Engineer and Technical Lead with 11+ years scaling resilient, cost-effective cloud infrastructure for high-traffic services. Currently running a self-service AI Agent Platform with Model Context Protocol (MCP) tool-calling, multi-agent orchestration, and human-in-the-loop (HITL) approval flows — ~200K requests/day, ~800 registered tools, ~60 production agents — bridging classic SRE discipline with the agentic-AI era. Track record: 40% EKS cost reduction via Karpenter / ARM / Spot / VPA, 99.99% uptime consolidating 12 API gateways to Kong at ~10k RPS, 300+ microservices migrated to EKS. Polyglot — production Go, Python, Java, and C#. Daily AI-tooling user (Claude, Cursor, Copilot, Gemini).
Work Experience
Staff Site Reliability Engineer · Careem
Jan 2023 — PresentTechnical Lead and SRE Architect mentoring a team focused on infrastructure provisioning and developer experience.
- Designed and built a self-service AI Agent Platform with MCP tool-calling, multi-agent orchestration, and HITL approval flows for internal infrastructure-support workflows — ~200K req/day, ~800 registered tools, ~60 agents in production. Engineering teams contribute their own tools and agents.
- Slashed EKS cluster costs by 40% via resource-optimization program — Karpenter consolidation, migration to ARM & Spot nodes, VPA, and custom scheduling (FinOps at scale).
- Consolidated 12 legacy API gateways (~10k RPS) onto a single Kong gateway with 99.99% uptime during cutover; built a GitOps-driven bespoke control plane for API route management.
- Designed and deployed a fully automated cloud-resource provisioning platform with GitOps (Terraform + Terragrunt) — provisioning time for new services from days to minutes.
Senior Site Reliability Engineer · Careem
Dec 2020 — Jan 2023Led migration of 300+ microservices from legacy AWS to AWS EKS.
- Engineered automated provisioning of multiple EKS clusters with Terraform + Terragrunt — repeatable, auditable infrastructure foundation.
- Provisioned centralized API Gateways using Kong with a GitOps control plane for API route management.
- Built the observability stack with Prometheus, Thanos, and Grafana — high-cardinality metric storage, SLO dashboards, cost tracking across the EKS fleet.
- Improved deployment velocity and safety via service mesh (Linkerd) with automated canary rollouts via Flagger.
- On-call 24×7 on a bi-weekly rotation for the cloud-infrastructure platform.
Infrastructure Architect & Team Lead · Intech IIS
Jul 2019 — Dec 2020- Designed and implemented the infrastructure of a Cloud IIoT Platform while leading a team of 6 polyglot engineers in a KanBan model — workload distribution, technology selection, mentoring, customer integration.
- Built a bespoke Kubernetes Ingress Operator for bare-metal Kubernetes clusters using RedHat operator-sdk (custom control-plane work, not just operator consumption).
- Secured & streamlined the software delivery lifecycle by integrating ArgoCD with Keycloak — a secure, auditable GitOps workflow for Helm-based deployments.
- Implemented Google SRE golden signals (latency / traffic / errors / saturation) on bare-metal Kubernetes clusters using Prometheus.
Senior Software Developer / DevOps Engineer · Intech IIS
Jan 2017 — Jun 2019- Containerized all services of a distributed Industrial IoT platform with Docker; orchestrated via Kubernetes and D2iQ stacks.
- Automated deployment of the IIoT platform using Ansible Playbooks; provisioned production-ready Kubernetes and D2iQ clusters.
- Built a serverless framework in Python running code on containers via Kubernetes / Marathon / Metronome.
- Built a parallel metric aggregation publisher for StatsD in Golang.
- Built an OPC → MQTT data pipeline service in C# using Avro RPC and REST.
- Built a generic client API using Java Streams for CRUD operations on IBM Maximo and SAP EAM systems via Avro schemas, exposed via Avro RPC.
- Built an OpenID Connect Identity Provider microservice using Spring OAuth2 with JDBC, Active Directory, LDAP, and Mongo storage backends (Spring JPA + Spring LDAP).
- Implemented dynamic proxying and JWT bearer authentication using OpenResty LuaJIT.
Software Developer · Diyatech (Alachisoft)
Jun 2014 — Jan 2017Core developer for NCache Enterprise (Redis-class distributed cache for .NET) and NosDB Enterprise (CouchDB/MongoDB-class NoSQL JSON store for .NET), both in C#.
- Reimplemented Microsoft's Collections to prevent Large Object Heap (LOH) leakage.
- Implemented the ASP.NET Core Session State Management middleware with NCache backend storage and full session-locking support.
- Implemented a reliable socket protocol for message delivery using Protobuf.
- Built file-storage-based B+ Tree indexing with support for single-attribute and compound indexes.
- Built the query execution system using Streams and nested Enumerables in C#.
Skills
- Site Reliability / Platform:
- SRE practices, SLO/SLI/error budgets, on-call & incident response, GitOps, Internal Developer Platforms, FinOps, capacity planning
- Kubernetes ecosystem:
- Kubernetes (managed + bare-metal), Karpenter, VPA, custom scheduling, Linkerd, Flagger, Helm, Kustomize, ArgoCD, operator-sdk (custom Go operators)
- Cloud:
- AWS (EKS, EC2, S3, VPC, Route 53, IAM), bare-metal Kubernetes, multi-cluster patterns
- Observability:
- Prometheus, Thanos, Grafana, Google SRE golden signals, SLO dashboards
- Infrastructure & CI/CD:
- Terraform, Terragrunt, Ansible, Docker, D2iQ, GitLab CI
- API & Gateway:
- Kong (12→1 consolidation, ~10k RPS, 99.99% uptime), OpenResty / LuaJIT, OpenID Connect, JWT, dynamic proxying
- AI / LLMOps:
- Model Context Protocol (MCP), multi-agent orchestration, HITL approval flows, AI Agent Platform engineering; daily use of Claude, Cursor, GitHub Copilot, Gemini
- Languages:
- Go (production), Python (production), Java (production), C# / .NET (production)
- Data / Storage internals:
- B+ Tree indexing (NosDB), distributed-cache internals (NCache), Mongo, LDAP, JDBC, Avro / Protobuf
Selected Writing
- LLM Tool Calling Series · Part 2 — Enabling MCP on Your Existing Microservices · Medium, October 2025
- LLM Tool Calling Series · Part 1 — Understanding Tool Calling and the Model Context Protocol (MCP) · Medium, June 2025 (published before MCP became mainstream)
- Keycloak Access Control Series (4 parts, August 2020) — token authentication, authentication flows, role-based access control, and Kubernetes integration. Series index on Medium
Awards
Excellence Award, Alachisoft Inc. — outstanding performance recognition (November 2015)
Education
Bachelors (Honors) in Computer Engineering, National University of Sciences and Technology (NUST). CGPA: 3.40. Graduated 2014.