The Opportunity

What's The Role We are seeking a dedicated Cloud Reliability Engineer to champion the reliability, availability, and security of our production SaaS platform. In this role, you will act as the first line of defense for cloud infrastructure, balancing your time between core production day to day operations —such as incident management, change management, monitoring, and triage—and automation to reduce operational toil. You will play a pivotal role in maintaining customer trust by strictly adhering to SLAs and compliance processes while driving continuous improvement through code.

What You'll Do Operational Excellence & Incident Management

Monitoring & Triage: Proactively monitor cloud infrastructure health to ensure high availability and performance. Act as the primary owner for production alert monitoring, triage, and swift resolution.
Incident Response: Manage critical incidents and escalations from identification to resolution. Lead root cause analysis (RCA) and post-incident reviews to minimize Mean Time To Recovery (MTTR) and prevent recurrence.
Change & Release Management: Execute and track production upgrades, multi-tenant deployments, and change requests within defined SLAs, ensuring zero-downtime maintenance where possible.
Escalation Support: Handle escalated Support cases and provide infrastructure support for field teams and other environments.
24/7 Availability: Participate in a shift-based schedule and on-call rotation to provide round-the-clock support for critical production systems.

Automation & Continuous Improvement

Task Automation: Utilize Python and Jenkins to script and automate repetitive operational tasks, reducing manual intervention and increasing efficiency.
Tooling Optimization: Assist in maintaining and optimizing monitoring, alerting, and CI/CD tools to streamline workflows.
Process Evolution: Identify opportunities to shift left on operations, transforming manual runbooks into automated self-healing mechanisms over time.

What You Bring

2–5 years of professional experience in Cloud Operations, Site Reliability Engineering (SRE), or K8s administration.
Hands-on experience with public cloud platforms (AWS, GCP, or Azure) in a production environment.
Operational knowledge of Kubernetes (EKS, GKE, or AKS) including troubleshooting and cluster management.
Moderate proficiency in scripting and automation, specifically using Python and Jenkins.
Strong understanding of ITIL processes (Incident, Change, Problem Management).
Demonstrated ability to prioritize tasks under pressure while maintaining strict SLAs.
Excellent collaboration skills to work effectively with Engineering, Product, and Support teams.
Bachelor’s degree in Computer Science, Information Technology, or equivalent work experience.

Preferred Skills

Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation.
Familiarity with cloud-native observability tools (e.g., CloudWatch, Stackdriver, Prometheus, Grafana).
Strong Linux system administration and networking troubleshooting skills.
Background in supporting enterprise-grade SaaS platforms with strict compliance and security requirements.

Working Conditions

Shift-Based Role: This position requires working in defined shifts to ensure global coverage.
On-Call: Regular participation in an on-call rotation is required.
Environment: Fast-paced, collaborative, and process-oriented environment with a strong focus on production stability.

Hybrid Work at ThoughtSpot This office-assigned role is available as a hybrid position.

Spotters assigned to an office are encouraged to experience the energy of their local office with an in-office requirement of at least three days per week. This approach balances the benefits of in-person collaboration and peer learning with the flexibility needed by individuals and teams.

What makes ThoughtSpot a great place to work? ThoughtSpot is the experience layer of the modern data stack, leading the industry with our AI-powered analytics and natural language search. We hire people with unique identities, backgrounds, and perspectives—this balance-for-the-better philosophy is key to our success. When paired with our culture of Selfless Excellence and our drive for continuous improvement (2% done), ThoughtSpot cultivates a respectful culture that pushes norms to create world-class products. If you’re excited by the opportunity to work with some of the brightest minds in the business and make your mark on a truly innovative company, we invite you to read more about our mission, and apply to the role that’s right for you.

ThoughtSpot for All At ThoughtSpot, diverse teams build better products. Complex data problems need many perspectives, not just one. We welcome different backgrounds, identities, and experiences, and we work to create a place where everyone can be themselves and do their best work. If this role excites you and you believe you’re a strong match, we encourage you to apply.

Cloud Reliability Engineer II

The Opportunity

About ThoughtSpot