Overview of A State Privacy Act
The California Consumer Privacy Act of 2018 (CCPA) gives consumers a set of rights to protect their personal information which is gathered by various businesses, including the TV industry.
These rights include:
- The right to know what personal information a business collects about them and how it is used and shared.
- The right to delete personal information
- The right to opt-out of the sale or sharing of personal information
In 2020 the CCPA was strengthened by Proposition 24 which added:
- The right to correct inaccurate information
- The right to limit the use and disclosure of such information
Given these legal requirements to be able to restrict the collection of information or even to request that previously collected information be deleted, TV advertising systems must create additional processes and controls for such data.
Let's Look at a Case Study
We engaged with an advertising provider that has an existing pipeline of data collection, data processing and data analysis, which is the lifeblood of the company. They needed a solution that could quickly and easily fit into their existing process and tooling so the new compliance standards would not disrupt their business.
The company receives audience data from a variety of sources which may also be in a variety of formats. This data is initially stored in its raw form in S3 with a prefix structure based on the ID of the vendor providing the data.
Each data file contains data that may or may not be affected by various compliance regulations. There is a need to be able to redact certain fields while still maintaining enough data integrity to be able to perform advertisement personalization. Depending on the particular state regulation this redaction may be required at data ingest time or upon the receipt of a request for redaction from a customer. Therefore one or more data transformation pipelines will be required.
The volume of data to be processed was significant: on the order of up to 10,000 files a day totaling as much as 500 GBs. This volume required deep knowledge of Glue and PySpark to achieve reasonable costs and to avoid resource exhaustion. AWS charges an hourly rate based on the number of data processing units (DPUs) used to run an ETL job. A single standard DPU provides 4 vCPU and 16 GB of memory, whereas a high-memory DPU (M-DPU) provides 4 vCPU and 32 GB of memory. AWS bills for jobs and development endpoints in increments of 1 second, rounded up to the nearest second.
The Solution Approach
Caylent leveraged a complex and customized data redaction pipeline in AWS Glue to recognize, store, and control the ETL pipelines for sensitive data to ensure compliance would be met and fit into their existing processes. Caylent set up the Glue pipeline with customer defined fields to take incoming files from different vendors and determine if that incoming data was sensitive or not, across multiple file types. There were unique schemas employed for different vendors and file types and converted back into the preferred file type of Parquet.
Caylent recommended and implemented specific types of jobs and execution to lower key cost drivers (DPU) in this process. Leveraging Caylent’s deep PySpark expertise we were able to proactively and preemptively modify data pipelines to avoid hitting resource limitations for the existing tooling. The pipeline was also built to control everything within the ETL job which allows the customer the ability to customize and work with variable volumes of data.
The development approach was to initially test the redaction of a single “sensitive” field: IP Address to validate the approach. Once the approach was validated the system was generalized to allow per-vendor specifications of sensitive fields.
To further save costs the approach made use of the fact that while the data redaction needs to be done in a timely manner it is not real-time. Therefore the Glue jobs that are created have the EXECUTION_CLASS set to FLEX. This allows the jobs to be run as time-insensitive which reduces the per-hour DPU cost from $0.44 to $0.29, a saving of 35%.
From migrating and modernizing your infrastructure, building cloud native applications & leveraging data for insights, to implementing DevOps practices within your organization, Caylent can help set you up for innovation on the AWS Cloud. Get in touch with our team to discuss how we can help you achieve your goals.