As we are entering in the new era of big data, data privacy has become the hot topic in public. Various big giants like Facebbok, Apple, Amazon and Google are hugely pervading users’ personal and social interaction to accumulate a vast pool of data at every bit of time, and also violating privacy.
So, how should privacy be protected in the environment where data is stored and shared with escalating pace and ingenuity? On the same side, does preserving privacy on the basis of traditional laws and regulations is sufficient? No, it demands the keen support of privacy protection techniques.
Various privacy preservation techniques are available that allow us to perform big data analysis in the form of statistical estimation, statistical learning, data (text) mining, etc, while guaranteeing the privacy of individual participants. One such approach is Differential Privacy.
Differential privacy is a modernized approach of cybersecurity where proponents claim to protect personal data far better than traditional methods. Let’s discover the standardized privacy preservation technique.
“Data is the pollution problem of the information age, and protecting privacy is the environmental challenge.” – Bruce Schneier
Definition of Differential privacy
Differential privacy is the technology that enables researchers and database analysts to avail a facility in obtaining the useful information from the databases, containing people's personal information, without divulging the personal identification about individuals.
This can be achieved by introducing a minimum distraction in the information, given by the database. The introduced distraction is immense enough that it is capable of protecting privacy and at the same time limited enough so that the provide information to analysts is still useful.
As a simple definition, differential privacy forms data anonymous via injecting noise into the dataset studiously. It allows data experts to execute all possible (useful) statistical analysis without identifying any personal information. These datasets contain thousands of individual’s information that helps in solving public issues and confine information about the individual themselves.
Differential privacy can be applied to everything from recommendation systems & social networks to location-based services. For example,
Apple employs differential privacy to accumulate anonymous usage insights from devices like iPhones, iPads and Mac.
Amazon uses differential privacy to access user’s personalized shopping preferences while covering sensitive information regarding their past purchases.
Facebook uses it to gather behavioral data for target advertising campaigns without defying any nation’s privacy policies.
There are various variants of differentially private algorithms employed in machine learning, game theory and economic mechanism design, statistical estimation, and many more.
Imagine you have two otherwise identical databases, one with your information in it, and one without it. Differential Privacy ensures that the probability that a statistical query will produce a given result is (nearly) the same whether it’s conducted on the first or second database. (From)
Differentially Private Algorithms
“Differential privacy is a formal mathematical definition of privacy.”
For example, consider an algorithm that analyzes a dataset and compute its statistics such as mean, median, mode, etc. Now, this algorithm can be considered as differentially private only if via examining at the output if a person cannot state whether any individual’s data was included in the actual dataset or not.
In simplest form, the differentially private algorithm assures that there is hardly a behaviour change when an individual enlists or moves the datasets. Or simply, the algorithm might produce an output, on the database that contains some individual’s information, is almost the same output that a database generates without having individuals’ information. This assurance holds true for any individual or any dataset.
Thus, regardless of how particular an individual’s information is, of the details of any other person in the database, the guarantee of differential privacy holds true and provides a formal assurance that individual-level information about participants in the database would be preserved, or not leaked.
What does it guarantee?
Differential privacy guarantees mathematically that a person, who is observing the outcome of a differential private analysis, will produce likely the same inference about an individual’s private information, whether or not that individual’s private information is combined in input for the analysis.
It also specifies verified mathematical assurance of privacy protection counter to a huge range of privacy attracts such as differencing attack, linkage attacks, etc.
What doesn’t it guarantee?
Differential privacy can’t assure that one supposes his/her secret will remain secret, it is significant to understand and recognize which information is casual or which is private for attaining benefits from differential privacy algorithms and to decrease loss.
Since, it protects the privacy of specific information, it can’t protect one’s secret if it is general information only.
Characteristics of Differential Privacy
Differential privacy has worthwhile characteristics that makes it a rich framework for evaluating the delicate personalized information and privacy preservation, some are following;
Quantifying the privacy loss
Under a differential privacy mechanism and algorithms, privacy loss can be measured that enables comparisons amidst different techniques. Also, Privacy loss is controllable, establishing a trade-off among privacy loss and accuracy of the generic information.
Quantifying loss enables the control and analysis of cumulative privacy losses across multiple computations, also understanding the behaviour of differentially private mechanisms under composition permits the design and analysis of compact differentially private algorithms from easier differentially private building blocks.
Differential Privacy allows the control and analysis of privacy loss acquired by groups (such as families).
Closure under post-processing
For post-processing, differential privacy is invulnerable, i.e a data professional cannot execute a function of the output of a differentially private algorithm without having additional knowledge about private databases and make it less differentially private.
Benefits of Differential Privacy
Differential privacy has various advantages over traditional privacy techniques;
Assuming all available information is identified information, differential privacy knocks out the challenging tasks considered when identifying all elements of the data.
Differential privacy is resistant to privacy attack on the basis of auxiliary information such that it can impede the linking attacks efficiently that are likely attainable on de-identified data.
Differential privacy is compositional, i.e, one can compute the privacy loss of conducting two differentially private analyses over the same data through summing up individual privacy losses for two analyses.
Here compositionality defines the making of meaningful guarantees of privacy while delivering multiple analysis outcomes from the same data. However, some techniques like de-identification are not compositional and multiple releases outcomes under these approaches can lead to a catastrophic loss of privacy.
Moreover, the availability of these advantages of differential privacy are the essential reasons to be picked up over some other data privacy techniques.
Beside that, being a new and robust tool, differential privacy' standards and best-practices are not easily available outside the research communities.
However it is expected that this limitation will be overcome over the time due to the rising requirement for robust and easy-to-implement solutions for data privacy.
(Must check: What is People Analytics?)
How does Differential Privacy Work?
Conventional data preservation techniques considered that privacy is the characteristic of an analysis’s interference. Though, it is an attribute of the analysis itself.
On the other hand, differential privacy preserves an individual’s privacy through adding some random noise into the dataset while conducting the data analysis. Simply, it would not be possible to recognize individual information on the basis of an analysis’ outcome via introducing noise.
However, after adding noise, the output of the analysis turns into an approximation, not the exact (accurate) result that would have been obtained only if conducted over the actual dataset. Additionally, it is also extremely possible that if a differential private analysis is performed multiple times, it might yield distinct outcomes each time as the randomness of the noises are being introduced in the datasets.
Ɛ (Epsilon): The privacy loss parameter, it determines the quantity of the noise to be introduced. The epsilon can be derived from the probability distribution, known as Laplace Distribution, that determines how much deviation is there in the computation if one of the data attributes has excluded from the dataset.
Smaller the Epsilon, smaller the deviation in the computations where the any users’ data was to be removed from the dataset. Or, higher values of Epsilon depict more accurate, less private results and lower Epsilon provides high randomized results that won’t let attackers learn much at all.
Thus, the small value of Epsilon will lead to stronger data preservation even if the computation outcomes will be barely accurate.However, an optimal value of Epsilon has not been determined yet that could guarantee/meet the necessitate level of data protection and accuracy. Depending on the trade-off amid privacy and accuracy that users must generate, differential privacy can be adopted globally.
These are the fundamentals about how differential privacy works, after knowing about how it works, how can we ensure that we have valuable data while preserving individuals’ privacy?
With data-driven approaches, a data analyst has to make good decisions on how to analyze data while protecting personally identifiable information. And here, differential privacy allows us to do that as explained in the video below with a simple example.
Differential privacy can be attained through summing randomised noise to a cumulative query that leads to saving individual entries without modifying the result.
Differentially private algorithms make assurance that attackers can learn virtually nothing more about an individual than they would understand if that individual’s record were absent from the dataset.
Differentially private algorithms can be implemented in privacy-preserving analytics products and are active in the field of research.