About the Role :
We are seeking a highly motivated Production Support Engineer with 2+ years of experience to ensure the continuous and efficient operation of our production systems. In this role, you will be responsible for monitoring, troubleshooting, and resolving production issues in real-time, as well as improving the overall stability and performance of our services.
You will work closely with development, QA, and operations teams to address incidents, identify root causes, and implement long-term solutions. If you thrive in high-pressure environments and enjoy problem-solving, this could be a perfect fit for you.
Key Responsibilities :
Monitor the health, performance, and availability of production systems and services
- Diagnose and resolve production issues quickly, minimizing downtime and impact on end-users
- Provide on-call support for production incidents and manage issue escalation as necessary
- Collaborate with development teams to investigate root causes of production issues and propose solutions
- Perform system health checks and regular system maintenance tasks to ensure optimal performance
- Implement monitoring tools and alerting systems to proactively identify potential issues before they impact users
- Deploy bug fixes, patches, and system upgrades in production environments
- Document issues , resolution steps, and operational procedures for knowledge sharing
- Assist in post-incident reviews and implement improvements based on lessons learned
- Help implement change management processes to ensure smooth and controlled deployments
- Ensure adherence to SLAs (Service Level Agreements) for incident resolution and response time
Qualifications : Required :
Bachelors degree in Computer Science, Information Technology, Engineering, or a related field2+ years of experience in production support or operations management in a tech environmentFamiliarity with Linux / Unix or Windows server administrationStrong experience with monitoring and alerting tools (e.g., Prometheus , Grafana , Nagios , New Relic )Ability to work with log aggregation and analysis tools (e.g., ELK Stack , Splunk )Proficiency in troubleshooting application, infrastructure, and network issuesExperience with databases (e.g., MySQL , PostgreSQL , MongoDB )Knowledge of incident management tools (e.g., JIRA , ServiceNow )Strong understanding of cloud platforms (e.g., AWS, Azure, GCP) and cloud infrastructureFamiliarity with CI / CD pipelines and deployment automation toolsPreferred :
Experience in automation and scripting (e.g., Bash , Python , Shell scripting )Familiarity with containerization technologies like Docker and orchestration tools like KubernetesExperience in load balancing , scaling , and disaster recovery practicesKnowledge of ITIL or other IT operations frameworksExperience in release management and deployment strategies