Site Reliability Engineer - Netflix Labs

  • Los Gatos, California
  • Partner Devices

Netflix is almost everywhere. Serving over 100 million customers in 190 countries on devices such as Smart TVs, set-top boxes, and apps on phones, tablets, and computers, Netflix is striving to deliver the vision of #NetflixEverywhere. The Engineers that create this service require a world-class environment and access to thousands of devices in which to build, and perfect, the Netflix experience. Netflix requires talented and driven individuals that can take our internal development and certification environment to the next level.

Labs is responsible for building and improving the environment for the certification and testing of the Netflix app on the devices people use every day. Our goal is to simplify and automate everything possible within this environment and change it to “a small matter of code”.  Finding and permanently resolving pain points for our partners and customers is our mandate. How we do it, is where you come in.
What we are looking for in you:
Role Responsibilities
• Develop environment and infrastructure tooling to better support the development and certification teams Labs partners with
• Participate in on-call rotation with other members of the Lab Engineering team
• Drive issue resolution with partner product engineering and certification teams
• Evangelize best practices around collaboration, reliability, security and performance to all partner teams
Minimum Requirements
• Understands mid- to large-scale complex systems from a reliability perspective
• Scripting abilities in Python, Perl, Go, or JVM-based languages
• Effective root cause identification, triage and mitigation
• Experience with configuration and troubleshooting of Linux, Java, Tomcat, and other middleware technologies
• Thorough understanding of monitoring (Nagios, Zabbix, or other) and the emission, collection, and analysis of metrics, to put together mitigation and remediation plans
• Strong communication skills and the ability to engage partner teams effectively
• Passion for resolving reliability issues and identify strategies to mitigate going forward
• Automation mindset - if you can automate it, do it.
We are looking for individuals with skills in one or more of the following Winning Attributes:
• Required: Mid to large scale project tooling development skills in C, C++, and/or Java (embedded/proprietary environments a huge plus)
• Understanding of development and deployment processes
• Effective debugging methodologies in a mixed device environment
• Web UI development for tooling a plus
• Experience with traditional networking switches (Cisco, Dell, HP, etc)
• Configuration, monitoring and debugging of networking issues
• Programming against exposed networking APIs for the above vendors
• Experience with modern SDN centric/hybrid switches (Quanta, Arista, Dell with Cumulous Linux) a plus