Site Reliability Engineer
2 weeks ago
Job type: Full-time
Hiring from: USA Only
Category: DevOps / Sysadmin
PubNub powers apps that bring people together in realtime for remote work, play, learning, and health. Thousands of companies use PubNub’s Realtime Communication Platform and its APIs as the foundation for online chat, live events, geolocation, remote control, and live updates, at massive global scale. Since 2010, PubNub has invested in the tools and global infrastructure required to serve customers like Adobe, DocuSign, Peloton, and RingCentral, delivering SOC 2 Type 2 security and reliability while meeting regulatory needs like HIPAA and GDPR. PubNub has raised over $70M from notable investors like Sapphire, Scale, Relay, Cisco, Bosch, Ericsson, and HPE.
We are an all-star technical team comprising of folks who have been part of successful acquisitions in enterprise and consumer software companies. If you like hyper scale systems and engineering projects that redefine limits, PubNub is for you.
PubNub is proud to be an EEO employer.
As a member of PubNub's Engineering organization, you will work alongside Engineers and Architects in designing, developing, operating and scaling PubNub’s Data Stream Network, with a focus on improving the reliability, scale and efficiency of our global Data Stream Network. The infrastructure you will manage creates billions of events and produces terabytes of data on a daily basis. You will have the unique opportunity to help architect PubNub's infrastructure to solve challenging problems related to distributed systems, real-time messaging, and large scale data management.
- Design processes for improving operational stability of PubNub services
- Identify, document and help improve performance and operational efficiency challenges
- Assist in rationalizing PubNub's infrastructure as code and automation tooling
- Create tooling with documentation to scale our distributed systems
- Ensure and enforce best application and network security practices
- Participate in incident management on-call rotation and drive root cause analysis
- Collaborate with engineering teams, product owners and other stakeholders to develop tooling and CI/CD patterns
- Help define Service Level Objectives to assess release readiness of all services
- Support, monitor and manage cloud infrastructure and environments (AWS EC2, DNS, load balancers, and databases)
Experience & Skills Required:
- 3+ years of cloud platform experience. AWS preferred
- 3+ years of programming (Python, GO, Java, or equivalent)
- Configuration management and automation tools such as Ansible, Terraform, etc
- Experience with CI/CD tools and implementing best practices
- Solid principles in cloud resources such as networking, load balancing, DNS, and security
- BS or MS in Computer Science or a related technical field
- Containerization experience (Docker, etc)
- Container orchestration systems management (Kubernetes, etc)
- Experience developing, supporting or operating large-scale, distributed SaaS products
- Desire to automate tedious tasks and eliminate inefficiencies
- A passion for system stability, performance, scalability or customer success
- Previous participation in Incident Management teams
Before you apply, please check if any restrictions apply in terms of time zone or country.
This job has a geo-restriction in place: USA Only.
Please mention that you come from Remotive when applying for this job.
Does this job need an edit? 🙈