Nishant Mandil.
I build reliable systems at scale.
I'm a Site Reliability Engineer based in Gurugram, India, specializing in designing and operating production systems with high availability. Currently focused on automating infrastructure, optimizing performance, and ensuring 99.9%+ uptime for critical services at Colt Technology Services.
About Me
Getting to know me
I'm a Site Reliability Engineer with over 3 years of experience operating and improving production systems. My journey in tech started with a passion for understanding how complex systems work and ensuring they stay reliable.
At Colt Technology Services, I've led infrastructure integrations during acquisitions, designed monitoring strategies aligned with SLIs/SLOs, and improved service uptime by 20% through proactive capacity planning and automated remediation.
I specialize in building scalable CI/CD pipelines, containerized microservices, and cloud-native architectures. My approach combines deep technical expertise with a focus on operational excellence and continuous improvement.
Cloud & DevOps
AWS, Jenkins, GitLab CI, GitHub Actions
Containers & Orchestration
Docker, Kubernetes, Terraform, Ansible
Monitoring & Observability
Prometheus, Grafana, CloudWatch
Programming
Python, Bash, Shell, Perl
Experience
Where I've worked
Software Engineer / Site Reliability
- Owned end-to-end service lifecycle including low-level design, implementation, validation testing, production deployment, and continuous monitoring, ensuring system reliability and operational stability
- Led infrastructure and network integration during large-scale acquisition, migrating production workloads with zero downtime and no customer impact
- Defined and optimized monitoring strategies and alerting workflows aligned with SLIs/SLOs, reducing incident detection and response time by 30%
- Improved overall service uptime by 20% through proactive capacity planning, failure analysis, and automated remediation of critical services
- Designed and maintained scalable CI/CD systems supporting multi-environment deployments, achieving 98% release reliability while reducing manual intervention
- Performed deep root cause analysis (RCA) on production incidents, implementing long-term reliability improvements that sustained 99.9% availability
- Optimized backend services and database performance through profiling and system-level tuning, improving API response time by 50%
Featured Projects
Things I've built
Production-Grade CI/CD & Reliability Platform
I engineered a secure, end-to-end CI/CD platform designed with SRE principles — reliability, observability, and risk mitigation built into every stage of delivery.
This project simulates a real enterprise workflow: Feature request → Secure build → Automated validation → Controlled deployment → Continuous monitoring.
🧱 Core Architecture
Built on isolated cloud infrastructure using:
- • Jenkins for pipeline orchestration
- • Kubernetes for production deployment
- • Docker for immutable workloads
- • SonarQube + Trivy for shift-left security
- • Prometheus + Grafana for observability
- • Amazon Web Services for infrastructure isolation
🔁 Reliability-Driven Delivery Pipeline
The pipeline enforces quality and security gates:
- • Automated build & unit testing
- • Static code analysis with enforced quality thresholds
- • Dependency & container vulnerability scanning
- • Versioned artifact management via Nexus
- • RBAC-secured Kubernetes deployments
🎯 Key Outcomes
- • 98% release reliability with automated validation
- • Reduced deployment risk through security gates
- • Real-time observability across all pipeline stages
Cloud-Native Microservices Platform
Architected and deployed a scalable microservices ecosystem with container orchestration, service mesh integration, and comprehensive observability across distributed systems.
🏗️ Platform Architecture
- • Containerized microservices with Docker
- • Kubernetes-based orchestration and auto-scaling
- • Service-to-service communication with load balancing
- • Centralized configuration management
- • API gateway for unified access control
📡 Observability & Reliability
- • Distributed tracing across service boundaries
- • Centralized logging with ELK stack
- • Prometheus metrics collection per service
- • Health checks and automated recovery
- • Circuit breakers for fault isolation
🎯 Impact
- • 99.9% platform availability achieved
- • Horizontal scaling reduces response time by 50%
- • Faster deployment cycles with isolated services
Enterprise Observability & Monitoring System
Designed and implemented a comprehensive monitoring platform aligned with SRE best practices, providing full-stack visibility from infrastructure to application metrics with intelligent alerting.
📈 Monitoring Layers
- • Infrastructure metrics (CPU, memory, disk, network)
- • Application performance monitoring (APM)
- • Service-level indicators (SLIs) tracking
- • Database query performance analysis
- • Container and pod health monitoring
🔔 Intelligent Alerting
- • SLO-based alerting to reduce noise
- • Multi-channel notifications (Slack, PagerDuty, Email)
- • Alert severity classification and routing
- • Anomaly detection with machine learning
- • Automated runbook integration
📊 Visualization & Reporting
- • Real-time dashboards with custom views
- • Historical trend analysis for capacity planning
- • SLA compliance reporting
- • Incident timeline visualization
🎯 Results
- • 30% faster incident detection and response
- • 60% reduction in alert noise through SLO alignment
- • Proactive issue identification before user impact
Academic Projects
College work and learning
onlineDiary
A secure web application to write and store personal diaries online. Features user authentication, data protection, and a fully responsive design to ensure privacy and accessibility across devices.
CHATTER
A real-time chat application to connect with strangers globally. Built with ReactJS and Firebase, featuring Google authentication, group chat functionality, and the ability to create custom chat rooms.
ALENA - Voice Assistant
A Python-based desktop virtual assistant operated entirely by voice commands. Capable of sending emails, opening/closing applications, searching Wikipedia, and performing various tasks through voice interaction.
Skills & Technologies
My technical toolkit
Reliability & Operations
Cloud & DevOps
Containers & IaC
Monitoring & Observability
Programming & Scripting
Databases & Systems
Get In Touch
I'm currently open to new opportunities and interesting projects. Whether you have a question or just want to say hi, feel free to reach out!