By Andrew Moon
June 11, 2021•2 min read
Utilizing data while protecting the privacy of individuals is a common challenge across organizations. Privacy engineering is an increasingly popular approach to balancing data collection with respectful data use. Within the discipline, differential privacy is one potential technique for analyzing large datasets while preserving the privacy of individuals within the dataset.
Ryan Rogers, Staff Software Engineer at LinkedIn, joined our Privacy_Infra() event to talk about how they built an audience engagement API that leverages differential privacy to protect user information while providing data insights that enable marketing analytics-related applications.
You can watch a recording of Ryan’s talk below, starting at 38:25.
According to Ryan, the project was a collaborative effort among multiple teams including data science applied research, backend infrastructure, and marketing solutions.
He noted that compliance was only one reason they chose to invest in this privacy system. Other reasons included a commitment to prioritize the experience of LinkedIn members as well as defend against potential attacks against their privacy such as reconstruction, differencing, and inference attacks.
“Differential privacy introduces noise or randomness into the computation so that the result that you get of your algorithm or of your computation is randomized,” Ryan explained. “So you get an actual distribution of all possible outcomes.”
The goal of differential privacy is to protect the identity of individuals within the dataset by providing computational results that don’t depend on any specific individual’s data being included. Differential privacy measures the distance between the original distribution and one that’s been randomized.
“In mathematics we say that a randomized algorithm, one that introduces noise, is differentially private if for any two neighboring datasets—differing in at least one person’s record—the output distributions are close to one another.” Ryan continued.
There are two parameters to determine closeness. One is epsilon, commonly referred to as the “privacy loss” parameter. The smaller the epsilon, the closer together the distributions are, and the smaller your privacy loss is. A large epsilon means the distributions are far apart and there is less privacy for the individuals in the dataset. The second parameter is delta, an additive factor that says the privacy loss is bounded most of the time.
LinkedIn’s use case includes their audience engagement API, which provides insights on content and audience data to external marketing partners. It’s built on top of Pinot, an open source project for fast, real-time data analytics. According to Ryan, the questions they needed to ask for this project included how much can a single user impact the outcome of analytics queries and how many queries should an advertiser be allowed to ask.
To address these questions, Ryan explained, LinkedIn built a privacy system with a budget management service to enforce a differential privacy budget on the returned results. This prevents analysts from being able to reconstruct the dataset by running a large number of queries on the same dataset.
Note: This post reflects information and opinions shared by speakers at Transcend’s ongoing privacy_infra() event series, which feature industry-wide tech talks highlighting new thinking in data privacy engineering every other month. If you’re working on solving universal privacy challenges and interested in speaking about it, submit a proposal here.
By Andrew Moon