← all jobs

[Remote] Senior Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. HavocAI is a leader in collaborative autonomy, focused on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability, performance, and resilience of mission-critical services while collaborating with various teams to improve operational maturity and reliability standards.

Responsibilities

  • Design and evolve reliability architecture for distributed and cloud-hosted systems
  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
  • Partner with platform and application teams to design systems for reliability, scalability, and operability
  • Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines
  • Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads
  • Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
  • Conduct root cause analysis for complex production incidents and drive long-term corrective actions
  • Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
  • Reduce operational toil through tooling, automation, and process improvements
  • Help build a culture of ownership, accountability, and continuous improvement across production systems
  • Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
  • Ensure services and data pipelines are observable, debuggable, and performant in production
  • Drive performance analysis and tuning across infrastructure, application, and service layers
  • Improve alert quality, reduce noise, and ensure operational signals are actionable
  • Partner with engineering teams to define meaningful reliability and performance metrics
  • Build automation to improve system reliability, deployment safety, and recovery processes
  • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
  • Support and improve Kubernetes-based environments and containerized workloads
  • Contribute to infrastructure-as-code practices and platform automation
  • Help define operational standards for cloud infrastructure, deployment workflows, and production services
  • Collaborate with security teams to ensure secure and resilient system design
  • Participate in disaster recovery planning, backup strategy, and resilience testing
  • Maintain strong operational practices around access control, secrets management, change management, and production access
  • Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases

Skills

  • 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
  • Strong experience operating large-scale distributed production systems
  • Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
  • Hands-on experience with Kubernetes and container orchestration
  • Programming or scripting experience in Go, Python, or similar languages
  • Experience designing and operating observability systems for production environments
  • Proven ability to lead incident response and drive reliability improvements
  • Strong communication skills and ability to collaborate across engineering teams
  • Ability to operate calmly and effectively under pressure
  • Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
  • Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms
  • Familiarity with AWS and large-scale cloud infrastructure
  • Experience with chaos engineering, fault injection, or resilience testing
  • Knowledge of CI/CD systems and progressive delivery practices
  • Experience working in high-reliability, safety-critical, defense, or mission-critical environments
  • Experience with Infrastructure as Code tools such as Terraform or Pulumi
  • Experience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability tools

Benefits

  • 100% Employer paid Health, Dental and Vision Insurance for you and your families
  • Life Insurance (Employer Paid)
  • Ability to participate in the companies 401k program (Matching)
  • Unlimited PTO policy with an enforced 2 week minimum
  • Equity Package
  • Work / Home Office Stipend
  • Global Entry
  • 16 Week Paid Parental Leave
  • Monthly Health and Wellness Stipend

Company Overview •

More open positions

Senior Site Reliability Engineer, Remote Job

Work from home Full-time role

Site Reliability Engineer II - Remote - Remote

Work from home Full-time role

Site Reliability Engineer 2 DevOps | REMOTE (US Citizenship required)

Work from home Full-time role

Kubernetes Engineer - Remote

Work from home Full-time role

Kubernetes Engineer Remote

Work from home Full-time role

Senior Manager, Customer Success & Enablement

Work from home Full-time role

[Remote] Account Executive Large Enterprise Pipeline Activation Job Details | Lumen Technologies

Work from home Full-time role

Agente de Service Desk

Work from home Full-time role

Virtual High School Digital Design/Art Teacher, Grades 7-12 - Indiana Statewide | Insperity | Handshake

Work from home Full-time role

US Virtual - Patient Care Coordinator (Healthcare) - Work from Home

Work from home Full-time role

Experienced Part-Time Remote Data Entry Clerk / Typing Specialist – Flexible Work Schedule and Career Growth Opportunities at careerzynith

Work from home Full-time role

eLearning Specialist

Work from home Full-time role

Remote Distributed Systems Engineer (L4) – Data Platform – High‑Impact Data Infrastructure Role at careerzynith – $28/hr

Work from home Full-time role

Remote Appointment Setter

Work from home Full-time role

Experienced Remote Live Chat Assistant – Entry Level Opportunity at careerzynith

Work from home Full-time role

Remote Customer Experience Chat Support Specialist – careerzynith – UAE Home‑Based Position

Work from home Full-time role

Inside Sales Representative - GO

Work from home Full-time role

Tech Lead, Web Core Product & Chrome Extension - Cardiff, United Kingdom

Work from home Full-time role

Study Manager, Global Study Management - FPS

Work from home Full-time role

Remote Real Estate & Property Law at Elk Grove Village, Illinois

Work from home Full-time role

Remote Customer Care Specialist – Philippines – careerzynith – Multi‑Channel Support, Client Success & Growth Opportunities

Work from home Full-time role