Data Platform
Overview
Netflix is a data-driven company. Almost all business decisions here are backed by data. Data Platform lies in the heart of the Platform Organization of Netflix serving all the data needs for the company. We manage state-of-the-art infrastructures, services and products and are constantly innovating to support all our business needs at scale.
We have the following teams in Data Platform supporting the data needs for the entire lifecycle of the Applications from Data Stores to Data Movement to Data Persistence.
Big Data Platform
The Big Data Platform team is responsible for software that enables our business partners to make business decisions efficiently and with ease. The platform provides an abstraction layer to orchestrate hundreds of thousands of big data workflows and jobs every day, executing on compute engines like Presto, Spark, or Druid. The team aspires to build an intelligent data warehouse that can auto-analyze and auto-optimize while spearheading Iceberg, an industry standard analytics data storage format on cloud object store (AWS S3, in our case).
Big Data Orchestration
The Big Data Orchestration team provides a platform of choice that enables scheduling, orchestrating and executing big data jobs and workflows in an easy to use manner. It owns the existing workflow orchestration platform called “Meson”, which is being replaced by a brand new product developed ground up by the team, enabling high throughput, horizontal scalability and advanced parametrized, event-based scheduling. For Job abstraction across various engines like Spark and Trino, the team has created an Open Source service called Genie. The team is also beginning work on the next generation Orchestration architecture including automatic and efficient ETL triggering and management among other services.
BDO Team Services Overview
Latest blog from team on Workflow Orchestration
Jun He and Harrington Joseph at Qcon plus - Robust Foundation for Data Pipelines at Scale - Lessons From Netflix
Please check our recent talk on Workflow Orchestration and Building a Scalable Workflow Scheduler with Netflix Conductor
Big Data Compute and Warehouse
Big Data Compute team is responsible for providing the cloud-native platform for distributed data processing at Netflix, working with Spark, Presto/Trino, Druid, and Iceberg, as well as other supporting technologies.
This team of 8 people (and growing) is central to batch data processing in Data Platform at Netflix. It provides support for Spark, to ETL data into the Petabytes-scale data warehouse and access that data using Spark and Presto/TrinoDB. It also provides sub-second latency for a certain class of queries using Druid. Iceberg table format project was started in the BDC team and is now a thriving open source project.
This is a dream team of highly passionate and intelligent engineers that work really well together. The team works on solving challenging problems at scale that have a huge impact on the Data Platform. The team has PMC members and committers that shape and contribute to open source projects. Support rotations allow the team members to grow operational skills and learn about technologies that may not be their primary focus.
Check out some of our talks on Iceberg, Presto and Druid.
Metadata Platform
This team is responsible for the broader data strategy at Netflix. We aim to provide visibility to the taxonomy and business context of all Netflix datasets (including video assets, unstructured logs, etc). We have built a Netflix-wide data catalog to capture and infer business metadata across all datasets at Netflix and track lineage of data as it moves between different datastores. We are also building a Netflix-wide schema registry to help datasets interoperate across systems and manage the lifecycle of schemas in different datastores. We are designing a centralized and yet customizable data detection framework that could sample, detect, and report violations across all datasets which would give us holistic insights on the risk profiles and quality of our data. We are also building a centralized and extensible policy engine that would allow our stakeholders to customize data policy rules for all datasets. We also own Metacat, which is our foundational and critical operational Metadata store that enables Big Data Processing and Compute. To learn more check out our vision doc.
Blogs and talks
Data Movement and Processing
The Real Time Data Platform team builds software that enables our business partners to transport, process, and sink all data at Netflix. The platform enables users to make tradeoff choices across such dimensions as latency, delivery guarantee, cost and programming abstraction. To achieve this, the team provides a set of abstracted Data Movement and Processing Product products (Data Mesh and Keystone) in addition to powerful programming interfaces, a managed platform to the lower layer processing engines (Flink and Mantis) and transport engine (Kafka).
Data Movement and Processing Platform
We are a data-driven company that strives to make every aspect of our workflow data-aware. Real-Time Data Infrastructure (RTDI) is responsible for building scalable data platforms to enable users from all backgrounds to have easy and timely access to these rich datasets in a streaming fashion. We work on core streaming engines (Flink and Mantis), messaging engines (Kafka), schema infrastructure, and platforms to build robust pipelines connecting various sources to sinks. The team builds products that would be used for building critical data movement and processing pipelines to solve a variety of problems ranging from data engineering to machine learning to real-time personalization and many more.
The team is currently focussed on building the next-generation Data Movement and Processing Platform to support abstraction on top of complex stream processing concepts such as filters, projections, windowing, sessionization, and deduplication, to name a few. We envision supporting Streaming SQL and architect a high-leverage collaborative platform where users can plug in custom patterns ranging from a simple udf to a complex source/sink connector. Abstraction is the core vision of this team and to build a layered abstracted product to make real-time stream processing available to broader sets of users with a self-service platform. The team mostly uses Java and other JVM-based technologies and deals with complex Distributed Systems at Scale to offer a low-latency platform to solve Business Problems.
Blogs and Talks:
- Netflix Data Mesh: Composable Data Processing
- Advancing Data Mesh: Building A Stream Processing EcoSystem of Reusable Processors and Datasets
- Data Movement in Netflix Studio via Data Mesh
- Keystone RealTime Processing Platform
Realtime Data Engines
We are an infrastructure team responsible for the platforms that enable real time data. We build and operate 3 main products: Kafka, Flink and Mantis. These form the building blocks that support analytical, operational and event driven use cases. Our value proposition is to provide leverage to other engineering organizations and empower them to focus on business logic. Our fully managed platforms offer world class developer experience while abstracting away complexities such as compute resource management, CI/CD, runtime artifact distribution, etc. And unquestionably, reliability and scalability are always our highest priority.
This is a dream team of highly passionate and intelligent engineers. We work in a collaborative environment that is remote friendly. We have committers on open source Iceberg, Flink and Mantis because we believe in giving back to the community.
Check out some of our talks and blog posts:
- Backfill Flink Data Pipelines with Iceberg Connector
- How Netflix Optimized Flink for Massive Scale on AWS
- Building Netflix’s Distributed Tracing Infrastructure
- Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications
- Stream-processing with Mantis
- SPS: the Pulse of Netflix Streaming
Storage and Insights
The Storage and Insights team provides storage services and abstractions to support the vast amount of data and metadata that we produce, ingest and manage. We are focusing on building value-add data services on top of AWS. We offer consistent mechanisms to create & manage object/block resources, integrate Netflix ecosystem for access control and provide observability into the cost & resource lifecycle of these resources, by taking ownership of existing tools and shaping a more cohesive strategy.
Impact is huge and is very consumable - we manage storage at exabytes scale and cater to all of the businesses of Netflix like Streaming, Studios and Content engineering.
Online Datastores
The Online Datastores (Core Data Platform) team provides data storage as a managed service for Netflix by enhancing and operating open source data stores. This includes feature development, tooling, automation, application development and operations related to the data stores in our portfolio. Portfolio consists of a carefully curated set of both caching and persistence data stores that currently includes Cassandra, Aurora, CockroachDB, EVCache, Elasticsearch and ZooKeeper.
Impact is huge and is very consumable - the datastores cater to all of the businesses of Netflix like Streaming, Studio, Gaming and Tudum. We impact several interactions of Netflix customers - be it Sign up, Sign in, recommendations, play, viewing history, payments, account settings, my list, etc.
Blogs
- https://netflixtechblog.medium.com/cache-warming-leveraging-ebs-for-moving-petabytes-of-data-adcf7a4a78c3
- https://netflixtechblog.com/announcing-evcache-distributed-in-memory-datastore-for-cloud-c26a698c27f7
- https://netflixtechblog.com/cache-warming-agility-for-a-stateful-service-2d3b1da82642
- https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70
Talks
- Cassandra Serving Netflix @ Scale - Vinay Chella, Netflix
- How Netflix Manages Version Upgrades of Cassandra at Scale
- How Netflix Provisions Optimal Cloud Deployments of Cassandra - Joey Lynch
- Apache Cassandra Meetup Hosted by Netflix
Data Access Team
The Data Access team builds and operates a flexible data gateway that facilitates data abstractions to operate at sub-millisecond latencies while allowing Netflix micro-services to more easily store, consume, and manage their data. This team holds a substantial responsibility in enabling Netflix microservices to satisfy their ever-growing and evolving data needs. This team is passionate about distributed data systems technology such as Cassandra, Elasticsearch, EVCache, Dynomite, and more. We are active in the open source community and believe in operating what we own. We are a small team responsible for business critical systems and are committed to a culture of feedback and engineering
Blogs
Data Platform Infrastructure
The Data Platform Infrastructure team acts as a platform for our own data platforms. Our shared infrastructure and tooling enable Netflix to quickly innovate on providing state-of-the-art data and analytics systems to the rest of the company without building bespoke scaffolding for each new system. To do this, we create high-leverage infrastructure, control, and deployment systems that are fine-tuned for the needs of running our data systems at scale.
The team plays an essential role in making it easy and efficient to use the Netflix platform and security products for building the data platform; uniquely, many of our tools and systems are written in Python, so this is a great team to consider if you enjoy working in a variety of languages.