Site Reliability Engineer · Gurugram, India

Nishant Mandil.

I build systems that don't break at 3am. Specializing in reliability engineering, intelligent automation, and scalable infrastructure at Colt Technology Services.

3.5+
Years SRE
99.9%
Uptime
20%
Uptime Gain
SCROLL
About

The person behind
the pipelines.

I'm a Site Reliability Engineer with 3+ years of experience operating and improving production systems at scale. My journey started with curiosity about how complex systems stay alive — and evolved into a career built around making sure they do.

At Colt Technology Services, I've owned end-to-end service lifecycles, led infrastructure migrations during acquisitions with zero downtime, and built monitoring strategies that caught incidents before users noticed them.

Recently I've been exploring the intersection of AI and SRE workflows — building tools that use LLMs to accelerate root cause analysis and reduce the cognitive load on engineers during production incidents.

AWS Kubernetes Jenkins Prometheus Grafana Python Docker Terraform SLIs/SLOs RCA
Nishant Mandil
Experience

Where I've
shipped.

Aug 2022 – Present
Software Engineer / Site Reliability
Colt Technology Services
  • Owned end-to-end service lifecycle — design, implementation, production deployment, and continuous monitoring ensuring operational stability
  • Led infrastructure and network integration during large-scale acquisition, migrating production workloads with zero downtime and no customer impact
  • Defined monitoring strategies aligned with SLIs/SLOs, reducing incident detection and response time by 30%
  • Improved overall service uptime by 20% through proactive capacity planning, failure analysis, and automated remediation
  • Designed and maintained scalable CI/CD systems achieving 98% release reliability while reducing manual intervention
  • Performed deep root cause analysis on production incidents, implementing improvements that sustained 99.9% availability
  • Optimized backend services improving API response time by 50% through profiling and system-level tuning
Projects

Things I've
built.

Real tools solving real problems — from AI-powered incident response to intelligent job outreach automation. Each project reflects an SRE mindset: reliability, observability, and automation at the core.

01 / Featured
SRE Log Analyzer
From raw logs to RCA report in seconds. Powered by free LLMs.

Production incidents cost time. Every minute an engineer spends manually grepping through logs is a minute users are impacted. This tool feeds raw logs into a free LLM and outputs a structured Root Cause Analysis report automatically — turning what used to take 30+ minutes of manual investigation into a near-instant insight.

Ingests raw production logs and detects error patterns, anomalies, and failure chains automatically
Uses LLM-based analysis to generate structured RCA reports with probable root cause and remediation suggestions
Built on free LLM APIs — no paid OpenAI credits required, fully open-source stack
Reduces mean time to diagnose (MTTD) by eliminating manual log triage entirely
Python LLM Log Analysis RCA AI/ML SRE Tooling
~30min
Manual RCA
eliminated
Free
LLM APIs
no cost
Auto
Pattern
detection
02 / Infrastructure
Scalable CI/CD — Jenkins + K8s
Reliability-first delivery. Security baked in. Zero guesswork.

Most CI/CD pipelines treat reliability as an afterthought. This one doesn't. Built with SRE principles from day one — every stage has quality gates, security scans, and automated validation before a single byte reaches production. Simulates a real enterprise workflow end-to-end.

End-to-end pipeline: feature commit → secure build → automated validation → controlled K8s deployment → live monitoring
Shift-left security with SonarQube static analysis + Trivy container vulnerability scanning at build time
RBAC-secured Kubernetes deployments with automated rollback on failure detection
Full observability via Prometheus + Grafana across all pipeline stages in real time
Jenkins Kubernetes Docker AWS SonarQube Trivy Prometheus Grafana
98%
Release
reliability
0
Manual
rollbacks
Multi
Security
gates
03 / Automation
HR Email Automator
Intelligent outreach. Zero bounces. Full observability.

Built out of necessity — most cold outreach tools send emails blindly and flood your inbox with bounce notices. This tool applies SRE thinking to job outreach: verify before you act, detect failures in real time, log everything. Complete with a live web dashboard, SMTP verification, IMAP bounce detection, and persistent reporting.

3-layer email verification: format check → fake address blocklist → DNS MX record validation before sending
Real-time bounce detection via IMAP — checks inbox during 45s sleep window and updates report automatically
Live web dashboard with stat counters, progress bar, dark log console, and report viewer
Crash-safe persistent Excel report — every result written to disk immediately, never lost mid-run
Python Flask SMTP IMAP DNS Automation Web UI
3-layer
Email
verification
Live
Bounce
detection
0
Data loss
on crash
Skills

My technical
toolkit.

🛡️
Reliability & Operations
SLIs / SLOs / SLAs Incident Management RCA High Availability Capacity Planning Performance Tuning
☁️
Cloud & DevOps
AWS EC2 AWS S3 IAM VPC Jenkins GitHub Actions GitLab CI
🐳
Containers & IaC
Docker Kubernetes Terraform Ansible Helm RBAC
📊
Monitoring & Observability
Prometheus Grafana CloudWatch AlertManager PromQL Log Analysis
Programming & Scripting
Python Bash Shell Perl Flask REST APIs
🗄️
Databases & Systems
PostgreSQL MongoDB Linux Admin TCP/IP DNS Load Balancing
Contact

Let's build something
reliable.

Open to new
opportunities.

Whether it's a challenging SRE role, an interesting infrastructure problem, or just a conversation about reliability engineering — I'd love to connect.

Available for opportunities

Based in Gurugram, India. Open to remote, hybrid, or relocation for the right role. Currently focused on SRE, DevOps, and Platform Engineering positions where reliability and automation truly matter.