Hi, my name is

Nishant Mandil.

I build reliable systems at scale.

I'm a Site Reliability Engineer based in Gurugram, India, specializing in designing and operating production systems with high availability. Currently focused on automating infrastructure, optimizing performance, and ensuring 99.9%+ uptime for critical services at Colt Technology Services.

About Me

Getting to know me

I'm a Site Reliability Engineer with over 3 years of experience operating and improving production systems. My journey in tech started with a passion for understanding how complex systems work and ensuring they stay reliable.

At Colt Technology Services, I've led infrastructure integrations during acquisitions, designed monitoring strategies aligned with SLIs/SLOs, and improved service uptime by 20% through proactive capacity planning and automated remediation.

I specialize in building scalable CI/CD pipelines, containerized microservices, and cloud-native architectures. My approach combines deep technical expertise with a focus on operational excellence and continuous improvement.

Cloud & DevOps

AWS, Jenkins, GitLab CI, GitHub Actions

Containers & Orchestration

Docker, Kubernetes, Terraform, Ansible

Monitoring & Observability

Prometheus, Grafana, CloudWatch

Programming

Python, Bash, Shell, Perl

Nishant Mandil
3+
Years Experience
99.9%
Uptime

Experience

Where I've worked

Software Engineer / Site Reliability

Colt Technology Services
Aug 2022 – Present
  • Owned end-to-end service lifecycle including low-level design, implementation, validation testing, production deployment, and continuous monitoring, ensuring system reliability and operational stability
  • Led infrastructure and network integration during large-scale acquisition, migrating production workloads with zero downtime and no customer impact
  • Defined and optimized monitoring strategies and alerting workflows aligned with SLIs/SLOs, reducing incident detection and response time by 30%
  • Improved overall service uptime by 20% through proactive capacity planning, failure analysis, and automated remediation of critical services
  • Designed and maintained scalable CI/CD systems supporting multi-environment deployments, achieving 98% release reliability while reducing manual intervention
  • Performed deep root cause analysis (RCA) on production incidents, implementing long-term reliability improvements that sustained 99.9% availability
  • Optimized backend services and database performance through profiling and system-level tuning, improving API response time by 50%

Featured Projects

Things I've built

🚀

Production-Grade CI/CD & Reliability Platform

I engineered a secure, end-to-end CI/CD platform designed with SRE principles — reliability, observability, and risk mitigation built into every stage of delivery.

This project simulates a real enterprise workflow: Feature request → Secure build → Automated validation → Controlled deployment → Continuous monitoring.

🧱 Core Architecture

Built on isolated cloud infrastructure using:

  • • Jenkins for pipeline orchestration
  • • Kubernetes for production deployment
  • • Docker for immutable workloads
  • • SonarQube + Trivy for shift-left security
  • • Prometheus + Grafana for observability
  • • Amazon Web Services for infrastructure isolation

🔁 Reliability-Driven Delivery Pipeline

The pipeline enforces quality and security gates:

  • • Automated build & unit testing
  • • Static code analysis with enforced quality thresholds
  • • Dependency & container vulnerability scanning
  • • Versioned artifact management via Nexus
  • • RBAC-secured Kubernetes deployments

🎯 Key Outcomes

  • • 98% release reliability with automated validation
  • • Reduced deployment risk through security gates
  • • Real-time observability across all pipeline stages
Jenkins Kubernetes Docker AWS SonarQube Trivy Prometheus Grafana
☁️

Cloud-Native Microservices Platform

Architected and deployed a scalable microservices ecosystem with container orchestration, service mesh integration, and comprehensive observability across distributed systems.

🏗️ Platform Architecture

  • • Containerized microservices with Docker
  • • Kubernetes-based orchestration and auto-scaling
  • • Service-to-service communication with load balancing
  • • Centralized configuration management
  • • API gateway for unified access control

📡 Observability & Reliability

  • • Distributed tracing across service boundaries
  • • Centralized logging with ELK stack
  • • Prometheus metrics collection per service
  • • Health checks and automated recovery
  • • Circuit breakers for fault isolation

🎯 Impact

  • • 99.9% platform availability achieved
  • • Horizontal scaling reduces response time by 50%
  • • Faster deployment cycles with isolated services
Docker Kubernetes Microservices Prometheus Grafana ELK Stack
📊

Enterprise Observability & Monitoring System

Designed and implemented a comprehensive monitoring platform aligned with SRE best practices, providing full-stack visibility from infrastructure to application metrics with intelligent alerting.

📈 Monitoring Layers

  • • Infrastructure metrics (CPU, memory, disk, network)
  • • Application performance monitoring (APM)
  • • Service-level indicators (SLIs) tracking
  • • Database query performance analysis
  • • Container and pod health monitoring

🔔 Intelligent Alerting

  • • SLO-based alerting to reduce noise
  • • Multi-channel notifications (Slack, PagerDuty, Email)
  • • Alert severity classification and routing
  • • Anomaly detection with machine learning
  • • Automated runbook integration

📊 Visualization & Reporting

  • • Real-time dashboards with custom views
  • • Historical trend analysis for capacity planning
  • • SLA compliance reporting
  • • Incident timeline visualization

🎯 Results

  • • 30% faster incident detection and response
  • • 60% reduction in alert noise through SLO alignment
  • • Proactive issue identification before user impact
Prometheus Grafana AWS CloudWatch Python AlertManager PromQL

Academic Projects

College work and learning

📔

onlineDiary

A secure web application to write and store personal diaries online. Features user authentication, data protection, and a fully responsive design to ensure privacy and accessibility across devices.

HTML CSS JavaScript Bootstrap PHP
View Live →
💬

CHATTER

A real-time chat application to connect with strangers globally. Built with ReactJS and Firebase, featuring Google authentication, group chat functionality, and the ability to create custom chat rooms.

ReactJS Firebase Google Auth Real-time DB
View Live →
🎤

ALENA - Voice Assistant

A Python-based desktop virtual assistant operated entirely by voice commands. Capable of sending emails, opening/closing applications, searching Wikipedia, and performing various tasks through voice interaction.

Python Voice Recognition Desktop App AI Assistant

Skills & Technologies

My technical toolkit

Reliability & Operations

SLIs/SLOs/SLAs Incident Management RCA High Availability Capacity Planning Performance Tuning

Cloud & DevOps

AWS (EC2, S3, IAM, VPC) Jenkins GitHub Actions GitLab CI CloudWatch

Containers & IaC

Docker Kubernetes Terraform Ansible

Monitoring & Observability

Prometheus Grafana Log Analysis Alerting

Programming & Scripting

Python Bash Shell Perl

Databases & Systems

PostgreSQL MongoDB Linux Administration TCP/IP DNS Load Balancing

Get In Touch

I'm currently open to new opportunities and interesting projects. Whether you have a question or just want to say hi, feel free to reach out!