Nishant Mandil — Site Reliability Engineer

About

The person behind
the pipelines.

I'm a Site Reliability Engineer with 3+ years of experience operating and improving production systems at scale. My journey started with curiosity about how complex systems stay alive — and evolved into a career built around making sure they do.

At Colt Technology Services, I've owned end-to-end service lifecycles, led infrastructure migrations during acquisitions with zero downtime, and built monitoring strategies that caught incidents before users noticed them.

Recently I've been exploring the intersection of AI and SRE workflows — building tools that use LLMs to accelerate root cause analysis and reduce the cognitive load on engineers during production incidents.

AWS Kubernetes Jenkins Prometheus Grafana Python Docker Terraform SLIs/SLOs RCA

Experience

Where I've
shipped.

Aug 2022 – Present

Software Engineer / Site Reliability

Colt Technology Services

Owned end-to-end service lifecycle — design, implementation, production deployment, and continuous monitoring ensuring operational stability
Led infrastructure and network integration during large-scale acquisition, migrating production workloads with zero downtime and no customer impact
Defined monitoring strategies aligned with SLIs/SLOs, reducing incident detection and response time by 30%
Improved overall service uptime by 20% through proactive capacity planning, failure analysis, and automated remediation
Designed and maintained scalable CI/CD systems achieving 98% release reliability while reducing manual intervention
Performed deep root cause analysis on production incidents, implementing improvements that sustained 99.9% availability
Optimized backend services improving API response time by 50% through profiling and system-level tuning

Projects

Things I've
built.

Real tools solving real problems — from AI-powered incident response to intelligent job outreach automation. Each project reflects an SRE mindset: reliability, observability, and automation at the core.

01 / Featured

SRE Log Analyzer

From raw logs to RCA report in seconds. Powered by free LLMs.

Production incidents cost time. Every minute an engineer spends manually grepping through logs is a minute users are impacted. This tool feeds raw logs into a free LLM and outputs a structured Root Cause Analysis report automatically — turning what used to take 30+ minutes of manual investigation into a near-instant insight.

Ingests raw production logs and detects error patterns, anomalies, and failure chains automatically

Uses LLM-based analysis to generate structured RCA reports with probable root cause and remediation suggestions

Built on free LLM APIs — no paid OpenAI credits required, fully open-source stack

Reduces mean time to diagnose (MTTD) by eliminating manual log triage entirely

Python LLM Log Analysis RCA AI/ML SRE Tooling

View on GitHub → ⭐ Star Repo

~30min

Manual RCA
eliminated

Free

LLM APIs
no cost

Auto

Pattern
detection

02 / Infrastructure

Scalable CI/CD — Jenkins + K8s

Reliability-first delivery. Security baked in. Zero guesswork.

Most CI/CD pipelines treat reliability as an afterthought. This one doesn't. Built with SRE principles from day one — every stage has quality gates, security scans, and automated validation before a single byte reaches production. Simulates a real enterprise workflow end-to-end.

End-to-end pipeline: feature commit → secure build → automated validation → controlled K8s deployment → live monitoring

Shift-left security with SonarQube static analysis + Trivy container vulnerability scanning at build time

RBAC-secured Kubernetes deployments with automated rollback on failure detection

Full observability via Prometheus + Grafana across all pipeline stages in real time

Jenkins Kubernetes Docker AWS SonarQube Trivy Prometheus Grafana

View on GitHub → ⭐ Star Repo

98%

Release
reliability

Manual
rollbacks

Multi

Security
gates

03 / Automation

HR Email Automator

Intelligent outreach. Zero bounces. Full observability.

Built out of necessity — most cold outreach tools send emails blindly and flood your inbox with bounce notices. This tool applies SRE thinking to job outreach: verify before you act, detect failures in real time, log everything. Complete with a live web dashboard, SMTP verification, IMAP bounce detection, and persistent reporting.

3-layer email verification: format check → fake address blocklist → DNS MX record validation before sending

Real-time bounce detection via IMAP — checks inbox during 45s sleep window and updates report automatically

Live web dashboard with stat counters, progress bar, dark log console, and report viewer

Crash-safe persistent Excel report — every result written to disk immediately, never lost mid-run

Python Flask SMTP IMAP DNS Automation Web UI

View on GitHub → ⭐ Star Repo

3-layer

Email
verification

Live

Bounce
detection

Data loss
on crash

Contact

Let's build something
reliable.

Open to new
opportunities.

Whether it's a challenging SRE role, an interesting infrastructure problem, or just a conversation about reliability engineering — I'd love to connect.

📧

Email nishantmandil105@gmail.com

💼

LinkedIn nishant-mandil-07b165159

💻

GitHub github.com/nishantmandil

📱

Phone +91 8085569375

Available for opportunities

Based in Gurugram, India. Open to remote, hybrid, or relocation for the right role. Currently focused on SRE, DevOps, and Platform Engineering positions where reliability and automation truly matter.

💻 github.com/nishantmandil

▸ sre-log-analyzer ▸ scalable-cicd-jenkins-k8s ▸ hr-email-automator

Nishant Mandil.

The person behind
the pipelines.

Where I've
shipped.

Things I've
built.

My technical
toolkit.

Let's build something
reliable.

Open to new
opportunities.

Nishant Mandil.

The person behindthe pipelines.

Where I'veshipped.

Things I'vebuilt.

My technicaltoolkit.

Let's build somethingreliable.

Open to newopportunities.

The person behind
the pipelines.

Where I've
shipped.

Things I've
built.

My technical
toolkit.

Let's build something
reliable.

Open to new
opportunities.