Sr. Resilience Engineering Advocate
- Los Gatos, California
- Cloud and Platform Engineering
The goal of the education and outreach function of the SRE team is to improve reliability and resilience of Netflix services by focusing on the people within the company, since it's the normal, everyday work of Netflix employees that creates our availability. To cultivate operational excellence, we reveal risks, identify opportunities, and facilitate the transfer of skills and expertise of our staff by sharing experiences.
Following an operational surprise, we seek to understand what the world looked like from the perspective of the people involved. We facilitate interviews, analyze joint activity, and produce artifacts like written narrative documents. Relationship building is a huge part of this role. Someone advocating for resilience engineering within Netflix will help stakeholders realize when this type of work is most effective.
The nature of our work is interdisciplinary so we recognize that a successful candidate can come from a wide variety of backgrounds (e.g., software engineering, SRE, human factors, safety science, systems engineering, technical product/program management, UX research, organizational psychology, cultural anthropology). We encourage you to apply even if you feel uncertain that you have the "right" background.
You may also be interested in the Senior Site Reliability Engineer opening on our team.
We think about:
Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is constantly undergoing change. Unforseen interactions are common and operational surprises arise from perfect storms of events.
Surprises over incidents and recovery more than prevention. We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.
Discovering new information and actionable outcomes over tracking stagnant action items. We aspire to pursue the ways that help us learn; not chase after numbers. Building a learning organization is a real way that we are able to proactively and continually improve.
- Increase Netflix's capacity to adapt to changes and surprises
- Enhance operational expertise at Netflix
- Advance Netflix as a learning organization
- Change the ways internal tool builders think about how people and tools interact
- Improve team health by empowering teams to balance operational responsibilities with development
- Exploring contributors versus constructing causes
- "I see how that action was reasonable" versus "you shouldn't have done that"
- ‘Human error’ as symptom versus ‘human error’ as cause
- Automation as a team player versus automation as a replacement for humans
- How things went right versus why things went wrong
- Adapting to new surprises over remediating prior incidents
- Narrative descriptions of surprising events versus out-of-context quantitative data
- Deep conversations versus shallow timelines
- Identifying weak signals versus broadly categorizing incidents
- Decisions driven by expert judgment versus decisions driven by superficial metrics
- Influence through developing relationships over exercising authority
- Investigate operational surprises
- Facilitate reviews and conversations to surface risks and opportunities
- Share context and develop holistic techniques that change how people work
- Design and execute on programs to socialize findings and drive operational change
- Education and training on identified risks and operational gaps
- Inform product and tooling roadmaps based on findings
- Experimentation to try new approaches for reaching an audience
- Use qualitative and quantitative data to inform recommendations and decisions
- Familiarity with resilience engineering concepts
- Software and systems engineering
- Technical product/program management in this specific domain
- Experience within systems that encounter complex failure modes
- Proficient with qualitative research methods
In any given day, you might be:
- Having 1:1 discussions with people involved in operational surprises
- Annotating chat channel history to identify how people coordinate and troubleshoot
- Writing up a narrative description that constructs "how we got here"
- Facilitating a review meeting where the goal is to learn
- Understanding how a team accomplishes everyday work
- Reading a set of existing writeups to discover patterns
- Organizing an internal meetup where people share experiences
- Talking with owners of centralized tools about user interface issues
- Giving a training session on how to do an investigation
Here are some resources that explain more about what we do and how we think:
This role is rewarding for people who can collaborate in a complex environment with a wide variety of groups across Netflix.
Share this listing:
Share this listing: