This job listing is archived

Customer Site Reliability Engineer


3 months ago

Job type: Full-time

Remote (USA Only)

Hiring from: USA Only

Category: DevOps / Sysadmin

Instana is seeking a Customer Site Reliability Engineer to manage large scale ingress platforms for our largest accounts.  In this role you will assist with the setup, maintenance, optimization and upgrading of these platforms.  In addition to working directly with customers and Technical Account Managers, you will work closely with the Technical Support team on escalated issues and to share knowledge and best practices as well as with the Engineering and Product Management teams to collaborate on how these systems can be further improved.


Areas of Responsibility Include:

  • Ensure large scale on-prem self-hosted Instana backends are functioning optimally
  • Work with new customer SRE teams to setup the platform
  • Closely monitor the platforms to make sure they are running optimally
  • Guide regular updates of the Instana backend 
  • Facilitate migrations and updates of the distributed datastores 
  • Participate in on-call support duty to ensure the timely response to critical incidents
  • Troubleshoot priority incidents and perform detailed root cause analysis


Skills and Experience Needed:

  • Strong written and verbal communication skills
  • Experience with components running in Java virtual machines (e.g. dropwizard, vertx) debugging Java exceptions in logs
  • Experience with distributed data stores and queues such as Cassandra, Elasticsearch, Zookeeper, ClickHouse and Apache Kafka is preferred.
  • System level understanding of Linux
  • Familiarity with: Java, Golang
  • Excellent skills debugging cloud based distributed systems
  • Experience with Docker and container orchestration in micro service architectures, specifically Kubernetes
  • Experience with Jenkins CI/CD pipelines and Git proficiency
  • Broad cloud provider experience (AWS, GCP, IBM Cloud)
  • Experience with Infrastructure as Code, e.g.CloudFormation, Pulumi, or Terraform
  • Experience with Configuration management tooling, e.g.  in Ansible, Chef, Puppet or Salt
  • Experience and familiarity with APM products and services is highly preferred
  • Strong written and verbal communication skills
  • 2+ years experience in a Site Reliability Engineering or DevOps environment
  • Bachelor of Science degree in Computer Science or other related technical disciplin

Before you apply, please check if any restrictions apply in terms of time zone or country.

This job has a geo-restriction in place: USA Only.

This job listing is archived

Please mention that you come from Remotive when applying for this job.

Does this job need an edit? 🙈

similar jobs

Remotive can help!

Not sure how to apply properly to this job? Watch our live webinar « 3 Mistakes to Avoid When Looking For A Remote Startup Job (And What To Do Instead) ».

Interested to chat with Remote workers? Join our community!