Conciliating AI & Privacy: Rules and Good Practices

Original Source Here

2. The 4 pillars to protect personal data in AI systems

Ensuring privacy in AI systems and regulating the use of consumer data, therefore, requires a paradigm shift. This new vision approaches privacy in the context of AI and the associated risks in a more holistic way, with measures designed to regulate the processing of personal data and identify cases of breach.

A number of organizations, companies, and several legislators offer different variations, but they all revolve around 4 pillars:

Understanding how the data was used to arrive at a particular decision and which characteristics played an important role in the conclusion is no easy task. This requires:

  1. Identification of decisions made by AI
  2. The analysis of specific decisions
  3. Establishing means by which an individual can seek an explanation

Reverse engineering machine learning algorithms can be difficult, if not impossible, a difficulty which increases even more in the case of deep learning algorithms.

The GDPR requires that, for any automated decision “where it produces legal effects concerning him or her or similarly significantly affects him or her” (credit, insurance coverage, etc.) the data subject can have recourse to a human being able to review the decision and explain its logic [2].

Incorporating a human element into the decision-making loop adds an additional step in the development process in addition to a significant regulatory burden.

Having a clear answer to the question “What does the company do with your data?” greatly facilitates the accountability of the company and dispels the concerns of users and/or partners.

The “privacy policy” is a concrete example. These traditionally long and unnecessary documents for most consumers can be replaced by statements:

  1. That provides a full description of nature and manner in which data is collected, used, and protected,
  2. That identifies the important use cases of personal data by AI and the various decision-making algorithms implemented in a product.

This is another requirement of the GDPR for new technologies or for high-risk uses of data. In this context, it is about assessing and mitigating confidentiality risks in advance, including the potential biases in the design of an AI system and the data provided to that system, as well as the potential impact on users. For instance, Twitter created controversy last September because it was demonstrated by users that its photo cropping algorithm was “racially” biased [3].

Whether internal (self-audits) or external (by third party organizations), audits are still necessary for monitoring and compliance with requirements. As audits are retrospective in nature, a good strategy is to combine their results for AI decisions with risk assessments, which are proactive. This can better inform the company’s position on AI and privacy, albeit — as is the case with explainability- auditing machine learning algorithms is difficult and still in development.

3. AI and privacy: major technical trends

Satisfying all the previously listed constraints while preserving the performance of the AI system may seem difficult, if not impossible, but there are several solutions to explore, in particular Federated Learning and Differential Privacy.

Federated Learning is an innovative approach to train machine learning models in a decentralized manner. The dissociation between the training process and the need to store data internally allows companies to unlock more capacity while reducing possible expenses and risks.

This learning method goes beyond the use of local models that make predictions on the user’s devices, towards a real collaboration between the devices that are members of this architecture:

  • The same machine learning model is first deployed on all devices,
  • These devices then complete the training phase with their local data and update the weights and hyper-parameters of the model locally,
  • Then they send these updates separately to the cloud or central server using encrypted communication,
  • These updates are then aggregated and averaged to obtain the average weights and hyper-parameters of the shared model.

Most importantly, all training data remains on the user’s device, and no individual updates are stored in an identifiable way in the cloud.

Communication in a federated learning architecture [4]

A flagship example of this type of application is the “Gboard” functionality developed for Android smartphones: When Gboard displays a suggestion, your phone locally stores information about the current context and whether you clicked on the suggestion (was the suggestion helpful or not?), the federated learning model trains locally on this history and millions of Android devices share their results (new weights and parameters) to have improvements in the next iteration of Gboard’s AI model.

How the Google Keyboard service works [5]

Differential Privacy transforms data from its raw state into a format that allows organizations to learn from the majority of that data while simultaneously ensuring that the results will not distinguish or re-identify the data of an individual.

Since the early 2000s, research has shown that 87% of the American population can be uniquely identified by the combination {date of birth, gender, zip code} [6]. In 2007 an anonymized database published by Netflix was reverse-engineered by researchers who unmasked the preferences and opinions of 500,000 users of the platform [7].

Major players such as Microsoft, Google, and Apple have turned to differential privacy to help ensure the confidentiality of sensitive data. This attention from large tech companies has helped push differential privacy out of research labs and integrate it into application design and product development.

Differential confidentiality is now a technique adopted by SMEs (small and medium-sized enterprises) as well as by software startups because they find it a great added value.

Example of an anonymized dataset

The differential privacy mechanism essentially adds noise (typically Gaussian or Laplacian) to the raw data to achieve a quantifiable level of confidentiality. By knowing this level, we can estimate the maximum amount of information that can be disclosed in our dataset. There are two main methods:

Local differential privacy = noise is added to each individual data point in the dataset (either by one of the company employees once the data has been obtained or by individuals themselves before releasing their data. available to the company).

Diagram of local differential confidentiality

Global Differential Privacy = Noise needed to protect individual privacy is added to the output of the raw data query.

Diagram of Global Differential Privacy

In general, Global Differential Privacy can lead to more accurate results compared to Local Differential Privacy while maintaining the same level of privacy. On the other hand, when using Global Differential Privacy, people donating their data should be confident that the receiving entity will add the necessary noise to maintain their privacy.


The alliance between the performance of AI and respect for data privacy is a difficult quest but full of opportunities and promise. It is clear that no measure applied in isolation can be completely effective in preventing abuses. So, where the decisions of intelligent algorithms are consequential, it makes sense to combine measurements at several levels to make them work together.

Note: This article was originally published in it’s French version here

References :


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: