Netflix is a data-driven company. Almost all business decisions here are backed by data. Data Platform lies in the heart of the Platform Organization of Netflix serving all the data needs for the company. We manage state-of-the-art infrastructures, services and products and are constantly innovating to support all our business needs at scale.
We have the following teams in Data Platform supporting the data needs for the entire lifecycle of the Applications from Data Stores to Data Movement to Data Persistence.
Big Data Platform
The Big Data Platform team is responsible for software that enables our business partners to make business decisions efficiently and with ease. The platform provides an abstraction layer to orchestrate hundreds of thousands of big data workflows and jobs every day, executing on compute engines like Presto, Spark, or Druid. The team aspires to build an intelligent data warehouse that can auto-analyze and auto-optimize while spearheading Iceberg, an industry standard analytics data storage format on cloud object store (AWS S3, in our case).
Big Data Orchestration
The Big Data Orchestration team provides a platform of choice that enables scheduling, orchestrating and executing big data jobs and workflows in an easy to use manner. It owns the existing workflow orchestration platform called “Meson”, which is being replaced by a brand new product developed ground up by the team, enabling high throughput, horizontal scalability and advanced parametrized, event-based scheduling. For Job abstraction across various engines like Spark and Trino, the team has created an Open Source service called Genie. The team is also beginning work on the next generation Orchestration architecture including automatic and efficient ETL triggering and management among other services.
BDO Team Services Overview
Latest blog from team on Workflow Orchestration
Jun He and Harrington Joseph at Qcon plus - Robust Foundation for Data Pipelines at Scale - Lessons From Netflix
Please check our recent talk on Workflow Orchestration and Building a Scalable Workflow Scheduler with Netflix Conductor
Big Data Compute and Warehouse
Big Data Compute team is responsible for providing the cloud-native platform for distributed data processing at Netflix, working with Spark, Presto/Trino, Druid, and Iceberg, as well as other supporting technologies.
This team of 8 people (and growing) is central to batch data processing in Data Platform at Netflix. It provides support for Spark, to ETL data into the Petabytes-scale data warehouse and access that data using Spark and Presto/TrinoDB. It also provides sub-second latency for a certain class of queries using Druid. Iceberg table format project was started in the BDC team and is now a thriving open source project.
This is a dream team of highly passionate and intelligent engineers that work really well together. The team works on solving challenging problems at scale that have a huge impact on the Data Platform. The team has PMC members and committers that shape and contribute to open source projects. Support rotations allow the team members to grow operational skills and learn about technologies that may not be their primary focus.
This team is responsible for the broader data strategy at Netflix. We aim to provide visibility to the taxonomy and business context of all Netflix datasets (including video assets, unstructured logs, etc). We have built a Netflix-wide data catalog to capture and infer business metadata across all datasets at Netflix and track lineage of data as it moves between different datastores. We are also building a Netflix-wide schema registry to help datasets interoperate across systems and manage the lifecycle of schemas in different datastores. We are designing a centralized and yet customizable data detection framework that could sample, detect, and report violations across all datasets which would give us holistic insights on the risk profiles and quality of our data. We are also building a centralized and extensible policy engine that would allow our stakeholders to customize data policy rules for all datasets. We also own Metacat, which is our foundational and critical operational Metadata store that enables Big Data Processing and Compute. To learn more check out our vision doc.
Blogs and talks
Data Movement and Processing
The Real Time Data Platform team builds software that enables our business partners to transport, process, and sink all data at Netflix. The platform enables users to make tradeoff choices across such dimensions as latency, delivery guarantee, cost and programming abstraction. To achieve this, the team provides a set of abstracted Data Movement and Processing Product products (Data Mesh and Keystone) in addition to powerful programming interfaces, a managed platform to the lower layer processing engines (Flink and Mantis) and transport engine (Kafka).
Data Movement and Processing Platform
We are a data-driven company that strives to make every aspect of our workflow data-aware. Real-Time Data Infrastructure (RTDI) is responsible for building scalable data platforms to enable users from all backgrounds to have easy and timely access to these rich datasets in a streaming fashion. We work on core streaming engines (Flink and Mantis), messaging engines (Kafka), schema infrastructure, and platforms to build robust pipelines connecting various sources to sinks. The team builds products that would be used for building critical data movement and processing pipelines to solve a variety of problems ranging from data engineering to machine learning to real-time personalization and many more.
The team is currently focussed on building the next-generation Data Movement and Processing Platform to support abstraction on top of complex stream processing concepts such as filters, projections, windowing, sessionization, and deduplication, to name a few. We envision supporting Streaming SQL and architect a high-leverage collaborative platform where users can plug in custom patterns ranging from a simple udf to a complex source/sink connector. Abstraction is the core vision of this team and to build a layered abstracted product to make real-time stream processing available to broader sets of users with a self-service platform. The team mostly uses Java and other JVM-based technologies and deals with complex Distributed Systems at Scale to offer a low-latency platform to solve Business Problems.
Blogs and Talks:
- Netflix Data Mesh: Composable Data Processing
- Advancing Data Mesh: Building A Stream Processing EcoSystem of Reusable Processors and Datasets
- Data Movement in Netflix Studio via Data Mesh
- Keystone RealTime Processing Platform
Realtime Data Engines
We are an infrastructure team responsible for the platforms that enable real time data. We build and operate 3 main products: Kafka, Flink and Mantis. These form the building blocks that support analytical, operational and event driven use cases. Our value proposition is to provide leverage to other engineering organizations and empower them to focus on business logic. Our fully managed platforms offer world class developer experience while abstracting away complexities such as compute resource management, CI/CD, runtime artifact distribution, etc. And unquestionably, reliability and scalability are always our highest priority.
This is a dream team of highly passionate and intelligent engineers. We work in a collaborative environment that is remote friendly. We have committers on open source Iceberg, Flink and Mantis because we believe in giving back to the community.
Check out some of our talks and blog posts:
- Backfill Flink Data Pipelines with Iceberg Connector
- How Netflix Optimized Flink for Massive Scale on AWS
- Building Netflix’s Distributed Tracing Infrastructure
- Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications
- Stream-processing with Mantis
- SPS: the Pulse of Netflix Streaming
Data Storage Platform
As our studio grows, and we continue to make content, our storage systems need to meet those demands. If we shoot in LA, editorial might happen in Vancouver, and other post-production functions could happen in Mumbai. The files need to be able to transfer from one location to another. Currently, the way the industry moves files is physical. What we are doing is similar to what email did to the paper and pen. We are putting everything into the cloud and moving large amounts of data in near real-time. What takes other studio’s weeks to months to complete, takes us just days. As we continue to innovate in this space, we will get closer and closer to real time, which will further enable our goal of creating content that brings our members joy. This team is at the heart of that initiative and without us, it would be impossible to meet that goal. This platform works in various infrastructure locations, not just AWS, but our own infra like Open connect. We are building storage services to support the vast amount of data and metadata that we ingest, curate, and manage. We are focusing on building value-add data services on top of open platforms.
This platform is a foundational building block to multiple studio technology products that will change the way the entertainment industry makes and produces content. To learn more about the team, here is one of our latest blog posts.
Cloud Data Engineering
The Cloud Data Engineering (CDE) team provides persistence as a service to the rest of Netflix, enabling our business partners to develop applications that bring streaming and delightful content experience to our customers. The team builds high leverage abstractions to improve developer experience and increase productivity, as well as operates and offers polyglot storage as a service on top of world-scale Cassandra, Evcache, CockroachDB, Dynomite and Elasticsearch clusters. Engineers on the CDE team are active in the open-source community and are frequent presenters at industry events.
The Online Datastores (Core Data Platform) team provides data storage as a managed service for Netflix by enhancing and operating open source data stores. This includes feature development, tooling, automation, application development and operations related to the data stores in our portfolio. Portfolio consists of a carefully curated set of both caching and persistence data stores that currently includes Cassandra, Aurora, CockroachDB, EVCache, Elasticsearch and ZooKeeper.
Impact is huge and is very consumable - the datastores cater to all of the businesses of Netflix like Streaming, Studio, Gaming and Tudum. We impact several interactions of Netflix customers - be it Sign up, Sign in, recommendations, play, viewing history, payments, account settings, my list, etc.
- Cassandra Serving Netflix @ Scale - Vinay Chella, Netflix
- How Netflix Manages Version Upgrades of Cassandra at Scale
- How Netflix Provisions Optimal Cloud Deployments of Cassandra - Joey Lynch
- Apache Cassandra Meetup Hosted by Netflix
Data Access Platform
The Data Access Platform (DACP) team connects Netflix applications to their data. We develop client libraries and services that federate access to the data stores and make it easier for application teams to use our storage platforms. At Netflix scale, the clients need to be built with extreme resiliency and scalability in mind. That means having built-in security, service discovery, auto-configuration, fall-back mechanisms, metrics and support for fault-injection, to name a few. Many of our products extend OSS offerings with Netflix-specific features and improvements. We eventually share a lot of that back to the OSS community, but some of it is unique to Netflix simply because we operate at such high scale and efficiency than most others.
Data Abstractions Platform
The Data Abstraction Platform team builds and operates a flexible data gateway that facilitates data abstractions to operate at sub-millisecond latencies while allowing Netflix micro-services to more easily store, consume, and manage their data. This team holds a substantial responsibility in enabling Netflix microservices to satisfy their ever-growing and evolving data needs. This team is passionate about distributed data systems technology such as Cassandra, Elasticsearch, EVCache, Dynomite, and more. We are active in the open source community and believe in operating what we own. We are a small team responsible for business critical systems and are committed to a culture of feedback and engineering
The Database Solutions Team works with developers across the company to understand their persistence needs and help them use datastore offerings effectively, taking into account their performance and availability requirements. We educate and evangelize best data practices through educational sessions, labs and presentations. We build tools that help with capacity planning, cost optimization and benchmarking. We collaborate with other CDE engineers to drive feature implementations and abstractions that are most valuable to our customers. We drive data model redesign and migrations to support ever evolving use cases of our customers.