Optimizing AWS Data Pipelines for Compliance in Digital Advertising

Infrastructure & DevOps Modernization

Learn how we helped an advertising customer setup automated, cost-effective pipelines to ensure compliance for sensitive data in their existing processes.

TV advertising is a major source of revenue for the producers of television programs. Although almost all streaming services require monthly payments, significant revenue is generated by advertising.The process of deciding which ads to present to which viewers at which times is both complex and critical for the advertising ecosystem. Advertisers want to know how many people are watching their ads and which actions they might take in response to those ads. Studies show that most people try to avoid watching advertisements but will watch if the ads are more targeted to their interests.

So, it is in the interest of advertisers and TV providers to know as much as possible about viewers so as to be able to provide the most targeted ads possible. Several states have responded to this increased data collection by creating legislation putting guardrails about the collection of such data. California is one such state though most privacy regulations are similar across most states.


Overview of A State Privacy Act

The California Consumer Privacy Act of 2018 (CCPA) gives consumers a set of rights to protect their personal information which is gathered by various businesses, including the TV industry.

These rights include:

  • The right to know what personal information a business collects about them and how it is used and shared.
  • The right to delete personal information
  • The right to opt-out of the sale or sharing of personal information

In 2020 the CCPA was strengthened by Proposition 24 which added:

  • The right to correct inaccurate information
  • The right to limit the use and disclosure of such information

Given these legal requirements to be able to restrict the collection of information or even to request that previously collected information be deleted, TV advertising systems must create additional processes and controls for such data. 


Let's Look at a Case Study

We engaged with an advertising provider that has an existing pipeline of data collection, data processing and data analysis, which is the lifeblood of the company. They needed a solution that could quickly and easily fit into their existing process and tooling so the new compliance standards would not disrupt their business.

The company receives audience data from a variety of sources which may also be in a variety of formats. This data is initially stored in its raw form in S3 with a prefix structure based on the ID of the vendor providing the data.

Each data file contains data that may or may not be affected by various compliance regulations. There is a need to be able to redact certain fields while still maintaining enough data integrity to be able to perform advertisement personalization. Depending on the particular state regulation this redaction may be required at data ingest time or upon the receipt of a request for redaction from a customer. Therefore one or more data transformation pipelines will be required.

The volume of data to be processed was significant: on the order of up to 10,000 files a day totaling as much as 500 GBs. This volume required deep knowledge of Glue and PySpark to achieve reasonable costs and to avoid resource exhaustion. AWS charges an hourly rate based on the number of data processing units (DPUs) used to run an ETL job. A single standard DPU provides 4 vCPU and 16 GB of memory, whereas a high-memory DPU (M-DPU) provides 4 vCPU and 32 GB of memory. AWS bills for jobs and development endpoints in increments of 1 second, rounded up to the nearest second.

The Solution Approach

Caylent leveraged a complex and customized data redaction pipeline in AWS Glue to recognize, store, and control the ETL pipelines for sensitive data to ensure compliance would be met and fit into their existing processes. Caylent set up the Glue pipeline with customer defined fields to take incoming files from different vendors and determine if that incoming data was sensitive or not, across multiple file types. There were unique schemas employed for different vendors and file types and converted back into the preferred file type of Parquet. 

Caylent recommended and implemented specific types of jobs and execution to lower key cost drivers (DPU) in this process. Leveraging Caylent’s deep PySpark expertise we were able to proactively and preemptively modify data pipelines to avoid hitting resource limitations for the existing tooling. The pipeline was also built to control everything within the ETL job which allows the customer the ability to customize and work with variable volumes of data. 

The development approach was to initially test the redaction of a single “sensitive” field: IP Address to validate the approach. Once the approach was validated the system was generalized to allow per-vendor specifications of sensitive fields.

To further save costs the approach made use of the fact that while the data redaction needs to be done in a timely manner it is not real-time. Therefore the Glue jobs that are created have the EXECUTION_CLASS set to FLEX. This allows the jobs to be run as time-insensitive which reduces the per-hour DPU cost from $0.44 to $0.29, a saving of 35%.

From migrating and modernizing your infrastructure, building cloud native applications & leveraging data for insights, to implementing DevOps practices within your organization, Caylent can help set you up for innovation on the AWS Cloud. Get in touch with our team to discuss how we can help you achieve your goals.

Infrastructure & DevOps Modernization
Brian Tarbox

Brian Tarbox

Brian is an AWS Community Hero, Alexa Champion, runs the Boston AWS User Group, has ten US patents and a bunch of certifications. He's also part of the New Voices mentorship program where Heros teach traditionally underrepresented engineers how to give presentations. He is a private pilot, a rescue scuba diver and got his Masters in Cognitive Psychology working with bottlenosed dolphins.

View Brian's articles

Related Services

Caylent Services

Infrastructure & DevOps Modernization

Quickly establish an AWS presence that meets technical security framework guidance by establishing automated guardrails that ensure your environments remain compliant.

Caylent Industries

Media & Entertainment

Reimagine end-user experience and transform your content workloads.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Optimizing Media Management on Amazon S3

Learn how we helped a media company optimize the management of their video assets on Amazon S3.

Infrastructure & DevOps Modernization

Stream Logs to OpenSearch via Kinesis

Data streaming eliminates the need to write custom applications for transferring data. Caylent’s Kennery Serain provides a reference architecture and code examples to showcase how to ingest data on OpenSearch using Kinesis Data Streams in near real-time.

Infrastructure & DevOps Modernization

Effective AWS Mocking with Moto

Learn how to effectively mock AWS services for better testing and development with Moto to make code testing more effective & efficient.

Infrastructure & DevOps Modernization