Site Reliability Engineer/ Expert/ Specialist (looking for Immediate available), Delhi, India

Organization: Société Internationale de Télécommunications Aéronautiques (SITA)
Country: India
Field location: New Delhi
Office: SITA in New Delhi
Follow @UNjobs

Overview

WELCOME TO SITA

At SITA, we keep airports moving, airlines flying smoothly, and borders open. Our technology and communication innovations power the success of the global air travel industry.

You'll find us in 95% of international airports, working closely with over 2,500 transportation and government clients. Each partnership brings unique challenges, and we thrive on delivering fresh solutions and cutting-edge tech to keep operations running like clockwork. We don't just move the world forward-we're proud to be recognized as a Great Place to WorkÂ® by 79% of our employees and certified in most of our growing locations. Here, we feel empowered, supported, and inspired to grow.

Are you ready to love your job?

The adventure begins right here, with you, at SITA.

ABOUT THE ROLE & TEAM

Site Reliability Engineer/ Expert/ Specialist you will be responsible for the proactive support of products so that there is high product performance that is continuously improved. Responsible for identifying and resolving the root causes of operational incidents, implementing solutions to improve stability and prevent recurrence. Manages the creation and maintenance of the event catalog to trigger events and develop both manual remediation approaches and automated workflows to resolve alerts. Oversee the deployment of IT services and solutions ensuring successful integration with minimal disruption. Focuses on operational automation and integration to enhance efficiency and collaboration between development and operations within service operations.

WHAT YOU WILL DO:

Define, build, and maintain support systems to ensure high availability and performance.
Handle complex cases for the PSO.
Implement automation for system provisioning, self-healing, auto-recovery, deployment, and monitoring.
Perform incident response and root cause analysis (RCA) for critical system failures.
Monitor system performance and establish Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs).
Collaborate with Development and Operations to integrate reliability best practices, including zero-downtime architecture.
Proactively identify and remediate performance issues.
Work closely with Product T&E, ICE, and Service Architects for new product productization as SGS technical expert.
Coordinate with internal and external stakeholders to improve service performance and ensure high availability.
Ensure Operations readiness to support new products.
Accountable within SGS for in-scope product availability and performance.

Problem Management

Conduct thorough problem investigations and root cause analyses to diagnose recurring incidents and service disruptions.
Coordinate with Incident Management teams and collaborate with PSOs and Engineering/Product teams to implement permanent solutions.
Monitor effectiveness of problem resolution activities and provide regular reporting to ensure continuous improvement.

Event Management

Define, build, and maintain an event catalog specifying active events, thresholds, and remediation actions; optimize it for efficiency.
Develop event response protocols, provide training, and ensure efficient incident handling.

Customer & Operational Support

Collaborate with Customer Success Managers to implement initiatives that enhance customer satisfaction and retention.
Prepare reports, documentation, and communication materials covering customer metrics, updates, and product changes.
Identify and implement improvements in internal processes and workflows.
Contribute to knowledge management resources such as FAQs and training materials.

Data Steward Responsibilities

Implement data governance policies defined by the Data Owner and ensure adherence to standards.
Monitor data quality, consistency, and compliance on an ongoing basis.
Act as a Subject Matter Expert (SME) for data within the assigned area, providing guidance and answering queries.

Qualifications

EXPERIENCE:

Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.

5+ years of experience in IT operations, service management, or infrastructure management, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Manager.
Proven experience managing high-availability systems and ensuring operational reliability.
Extensive experience in root cause analysis (RCA), incident management, and developing permanent solutions for recurring service disruptions.
Hands-on experience with CI/CD pipelines, automation, system performance monitoring, and infrastructure as code (IaC).
Strong background in collaborating with cross-functional teams (Development, Operations, Engineering, etc.) to improve operational processes and service delivery.
Experience managing deployments, conducting risk assessments, and optimizing event and problem management processes.
Familiarity with cloud technologies, containerization, and scalable architectures, including zero-downtime deployment strategies.

Technical Skills (Must-to-Have):

Strong AKS & On prem K8s skills and experience,
Scripting (Ansible & Bash, Python - combination of anything would be great),
Automation,
CI/CD pipeline,
Terraform exposure,
Azure (or) AWS skill.
Basic DB skills.
Strong problem-solving skills & quick learner.
SRE mindset.