Dubai · GMT+4 · Open to relocate

Usman Shahid

Staff Site Reliability Engineer and Technical Lead. 11+ years scaling resilient, cost-effective cloud infrastructure for high-traffic services. Currently running a self-service AI Agent Platform at Careem with MCP tool-calling, multi-agent orchestration, and HITL approval flows — bridging classic SRE discipline with the agentic-AI era.

Open to visa-sponsored relocation: EU · UK · Ireland · Netherlands · Germany.

40%
EKS cost cut
99.99%
Kong uptime
300+
µservices on EKS
200K+
agent req/day
60
prod agents
800
MCP tools

Experience

Where I've shipped infrastructure

Careem

Dubai, UAE

Staff Site Reliability Engineer · Tech Lead

Jan 2023 — Present

Technical Lead and SRE Architect mentoring a team focused on infrastructure provisioning and developer experience.

  • Designed and built a self-service AI Agent Platform with MCP tool-calling, multi-agent orchestration, and HITL approval flows for internal infrastructure-support workflows — ~200K req/day, ~800 registered tools, ~60 agents in production. Engineering teams contribute their own tools and agents.
  • Slashed EKS cluster costs by 40% via Karpenter consolidation, migration to ARM and Spot nodes, VPA, and custom scheduling.
  • Consolidated 12 legacy API gateways onto a single Kong gateway at ~10k RPS with 99.99% uptime during cutover; built a GitOps-driven control plane for API route management.
  • Designed and deployed a fully automated cloud-resource provisioning platform with GitOps (Terraform + Terragrunt) — provisioning time for new services from days to minutes.

Careem

Dubai, UAE

Senior Site Reliability Engineer

Dec 2020 — Jan 2023

Led migration of 300+ microservices from legacy AWS infrastructure to AWS EKS.

  • Engineered automated provisioning of multiple EKS clusters with Terraform + Terragrunt — repeatable, auditable infrastructure foundation.
  • Built the observability stack with Prometheus, Thanos, and Grafana — high-cardinality metric storage, SLO dashboards, and cost tracking across the EKS fleet.
  • Improved deployment velocity and safety via service mesh (Linkerd) with automated canary rollouts via Flagger.
  • On-call 24×7 on a bi-weekly rotation.

Intech IIS

On-prem & Cloud IIoT

Infrastructure Architect & Team Lead

Jul 2019 — Dec 2020

Designed and implemented the infrastructure of a Cloud IIoT Platform while leading a team of 6 polyglot engineers in a KanBan model — workload distribution, technology selection, mentoring, customer integration.

  • Built a bespoke Kubernetes Ingress Operator for bare-metal Kubernetes clusters using RedHat operator-sdk (custom control-plane work, not just operator consumption).
  • Secured and streamlined the software delivery lifecycle by integrating ArgoCD with Keycloak — a secure, auditable GitOps workflow for Helm-based deployments.
  • Implemented Google SRE golden signals (latency / traffic / errors / saturation) on bare-metal Kubernetes using Prometheus.

Intech IIS

Senior Software Developer / DevOps Engineer

Jan 2017 — Jun 2019

Containerized all services of a distributed Industrial IoT platform with Docker; orchestrated via Kubernetes and D2iQ. Polyglot delivery across Python, Go, Java, and C#.

  • Built a serverless framework in Python running code on containers via Kubernetes / Marathon / Metronome.
  • Built a parallel metric aggregation publisher for StatsD in Golang.
  • Built an OPC → MQTT data pipeline service in C# using Avro RPC and REST.
  • Built an OpenID Connect Identity Provider microservice using Spring OAuth2 with JDBC, Active Directory, LDAP, and Mongo storage backends.
  • Implemented dynamic proxying and JWT bearer authentication using OpenResty LuaJIT.

Diyatech (for Alachisoft)

Software Developer · NCache & NosDB internals

Jun 2014 — Jan 2017

Core developer for NCache Enterprise (Redis-class distributed cache native to .NET) and NosDB Enterprise (CouchDB/MongoDB-class NoSQL JSON store native to .NET), both in C#.

  • Reimplemented Microsoft's Collections to prevent Large Object Heap (LOH) leakage.
  • Implemented the ASP.NET Core Session State Management middleware with NCache backend storage and full session-locking support.
  • Built file-storage-based B+ Tree indexing with support for single-attribute and compound indexes.
  • Built the query execution system using Streams and nested Enumerables.

Excellence Award, Alachisoft Inc. (Nov 2015)

Writing

Selected technical writing

2020-08 4 parts · OIDC, RBAC, K8s integration

Keycloak Access Control Series

Token authentication, authentication flows, role-based access control, and Kubernetes integration.

Full archive on Medium. See all writing →

Contact

Let's talk