Site Reliability Engineer - Netflix Labs
- Los Gatos, California
- Partner Devices
Netflix is almost everywhere. Serving 104 million customers in 190 countries on devices such as Smart TVs, set-top boxes, and apps on phones, tablets, and computers, Netflix is striving to deliver the vision of #NetflixEverywhere. The Engineers that create this service require a world-class environment and access to thousands of devices in which to build, and perfect, the Netflix experience.
Netflix requires talented and driven individuals that can take our internal development and certification environment to the next level. Our goal is to simplify and automate everything possible within Lab Engineering and change it to “a small matter of code”. Finding and permanently resolving pain points for our partners and customers is our mandate. How we do it, is where you come in.
Lab often invokes the imagery of white lab coats working in a sterile environment. Netflix Labs is a place of experimentation and innovation, taking risks, taming chaos, and we “get our hands dirty” - in a fun way.
What we are looking for in you:
•Develop effective tooling, alerts, and response, to identify and address reliability risks
•Participate in on-call rotation with other members of the Lab Engineering team
•Drive issue resolution with partner product engineering and certification teams
•Evangelize best practices around collaboration, reliability, security and performance to all partner teams
•Effective root cause identification, triage and mitigation
•Experience with configuration and troubleshooting of Linux, Java, Tomcat, and other middleware technologies
•Thorough understanding of monitoring (Nagios, Zabbix, or other) and the emission, collection, and analysis of metrics, to put together mitigation and remediation plans
•Understands mid- to large-scale complex systems from a reliability perspective
•Scripting abilities in Python, Perl, Go, or JVM-based languages
•Strong communication skills and the ability to engage partner teams effectively
•Passion for resolving reliability issues and identify strategies to mitigate going forward
•Automation mindset - if you can automate it, do it.
We are looking for individuals with skills in one or more of the following Winning Attributes:
•Mid to large scale project tooling development skills in C, C++, and/or Java (embedded/proprietary environments a huge plus)
•Understanding of development and deployment processes
•Effective debugging methodologies in a mixed device environment
•Web UI development for tooling a plus
•Experience with Cisco, and/or Quanta OCP switches
•Configuration, monitoring and debugging of networking issues
•Programming against exposed networking APIs for the above vendors