Site Reliability Engineer (SRE) – Platform Reliability & Operational Excellence IRC278895

Description

Same As above

Requirements

Site Reliability Engineer (SRE) – Platform Reliability & Operational Excellence
Overview

At Client, we’re scaling a mission‑critical safety and automation platform as we evolve from a monolith into distributed, event‑driven and microservice-based systems. Reliability, latency, and operational efficiency are foundational—not afterthoughts.
We’re seeking a Site Reliability Engineer (SRE) who blends software engineering discipline with systems, infrastructure, and observability expertise. You’ll own availability, performance, scalability, and production readiness across services—driving automation, reducing toil, and enabling fast, safe delivery.
This role is for someone who wants to shape a modern reliability culture while protecting a platform that directly advances road safety through real-time data, analytics, and AI.

Key Responsibilities
· Define, implement, and iterate SLIs/SLOs (latency, availability, errors, saturation); operationalize error budgets and trigger corrective action.
· Engineer end‑to‑end observability (metrics, logs, traces, events) leveraging Datadog to accelerate detection and root cause analysis.
· Automate infrastructure (Terraform), deployment workflows, self‑healing mechanisms, and progressive delivery (canary / blue‑green).
· Lead incident lifecycle: detection, triage, mitigation, coordination, communication, and high-quality post‑incident reviews that drive systemic fixes.
· Build and optimize CI/CD pipelines (GitHub Actions or equivalent) with reliability, rollback safety, and change quality controls.
· Perform capacity & performance engineering: load modeling, autoscaling policies, cost / efficiency tuning.
· Reduce toil via tooling, runbooks, proactive failure analysis, chaos / fault injection (AWS FIS or similar).
· Partner with development teams on architectural reviews, production readiness (operability, resilience, security, observability).
· Enforce least‑privilege, secrets management , and infrastructure security; integrate policy as code.
· Improve alert quality (noise reduction, actionable context) to lower MTTR and fatigue.
· Champion reliability patterns: backpressure, graceful degradation,, circuit breaking
· Support distributed systems debugging (timeouts, partial failures, consistency anomalies) with emphasis on AI.
· Contribute to governance of change management, deployment health gates, and release safety.
· Document playbooks, escalation paths, and evolving reliability standards.
· Treat reliability as a product: roadmap, KPIs, stakeholder alignment, continuous improvement.

Preferred Qualifications

· 3+ years in SRE / Production Engineering / DevOps
· Proficient in one or more: Go, Python, TypeScript/Node.js, or Ruby for automation, tooling, and services.
· Strong Linux internals and networking fundamentals (DNS, TLS, HTTP, routing, load balancing).
· Hands-on Infrastructure as Code (Terraform) and GitOps workflows.
· Containers & orchestration (AWS ECS) including resource tuning & scaling strategies.
· Production-grade observability: Prometheus, Grafana, OpenTelemetry, ELK, Datadog (preferred).
· CI/CD design (pipelines, promotion strategies, automated verification, rollout / rollback).
· Full incident management lifecycle & quantitative postmortem practices.
· Experience with distributed systems failure modes (latency spikes, retry storms, thundering herds).
· Chaos / fault injection frameworks (AWS FIS preferred).
· Performance / load testing (k6, Locust, Gatling) and profiling for bottleneck isolation.
· BS/MS in Computer Science, Engineering, or equivalent practical expertise.

Mindset & Behaviors

· Bias for automation and measurable reliability outcomes.
· Calm, clear communicator under pressure; drives clarity during ambiguity.
· Sees reliability as a product with customers, SLAs, and iteration cycles.
· Data-driven; prefers leading indicators over reactive firefighting.
· Raises the bar for operational excellence and shared ownership.

Why Join Us
· Shape quality & reliability strategy for a modern, mission-driven safety platform.
· Direct impact: your work protects communities and improves public safety outcomes.
· Work across observability, distributed systems, infrastructure automation, and high-velocity delivery.
· Influence engineering culture: shift-left reliability, proactive resilience, sustainable on-call.
· Collaborate with teams modernizing architecture (microservices, event streaming, serverless, edge).
· Leverage advanced tooling (Datadog, Terraform, progressive delivery frameworks).
· Join a culture focused on learning loops, autonomy, and meaningful impact.

Job responsibilities

same as above

What we offer

Culture of caring. At GlobalLogic, we prioritize a culture of caring. Across every region and department, at every level, we consistently put people first. From day one, you’ll experience an inclusive culture of acceptance and belonging, where you’ll have the chance to build meaningful connections with collaborative teammates, supportive managers, and compassionate leaders.

Learning and development. We are committed to your continuous learning and development. You’ll learn and grow daily in an environment with many opportunities to try new things, sharpen your skills, and advance your career at GlobalLogic. With our Career Navigator tool as just one example, GlobalLogic offers a rich array of programs, training curricula, and hands-on opportunities to grow personally and professionally.

Interesting & meaningful work. GlobalLogic is known for engineering impact for and with clients around the world. As part of our team, you’ll have the chance to work on projects that matter. Each is a unique opportunity to engage your curiosity and creative problem-solving skills as you help clients reimagine what’s possible and bring new solutions to market. In the process, you’ll have the privilege of working on some of the most cutting-edge and impactful solutions shaping the world today.

Balance and flexibility. We believe in the importance of balance and flexibility. With many functional career areas, roles, and work arrangements, you can explore ways of achieving the perfect balance between your work and life. Your life extends beyond the office, and we always do our best to help you integrate and balance the best of work and life, having fun along the way!

High-trust organization. We are a high-trust organization where integrity is key. By joining GlobalLogic, you’re placing your trust in a safe, reliable, and ethical global company. Integrity and trust are a cornerstone of our value proposition to our employees and clients. You will find truthfulness, candor, and integrity in everything we do.

About GlobalLogic

GlobalLogic, a Hitachi Group Company, is a trusted digital engineering partner to the world’s largest and most forward-thinking companies. Since 2000, we’ve been at the forefront of the digital revolution – helping create some of the most innovative and widely used digital products and experiences. Today we continue to collaborate with clients in transforming businesses and redefining industries through intelligent products, platforms, and services.