Web scraping and personal data protection: what are the application limits of the GDPR

Web Scraping: is it legal or illegal? Let’s start with the latest news. It comes to us from the United States: following a decision of the Court of Appeal – 9th Circuit, confirmed the company hiQ Labs inc., which is active in the field of artificial intelligence preliminary treatment compared to LinkedIn which has already been acquired in 2019. Basically, the social network provider can not prevent the “dativorous” counterpart from collecting the contents of his user profiles en masse. Why are we interested in all this?

For three reasons: first of all because these personal data are ours, both effectively and empathetically. Secondly, because in the European Union, the activity would be illegal in principle, which concerns us in a global market. Finally, because history is part of a broader phenomenon of data accumulation, selection and intelligence, with which it is necessary to measure.

We remind you of this recent Clearview AI case, on which the various European supervisory authorities, such as that of the Land of Hamburg, have expressed their views, cf. decision 545/2020, the French CNIL, cf. Decision no. MED 2021-134 of 1 November 2021, IMY, ie the Swedish authority, cf. decision of 10 February 2021 (DI-2020-2719), the Italian Guarantor, by order of 10 February 2022 [9751362]imposed order-injunction for twenty million euros.

The effects of hiQ Labs activity seem, at first glance, less penetrating in the sphere of stakeholders, but the underlying mechanism is the same. The company enriches the data, creates profiles and then markets the results. In particular – we learn from the proposal – it collects a wide range of news for social network users, including work history and skills, which it processes through a proprietary prediction algorithm, eventually a “black box”.

Do not look at the data as much as the processing

As for the Clearview AI case, the decisive point would have been omitted if we had stopped at the self-descriptive element of the word scrape, i.e. “scratching”, “scratching”, in short, in obtaining information as such. Of course, the mass collection already incorporates in itself an elaboration which, taking into account the legislation of the European Union, would first of all need a legal basis, of which it is necessary to document and prove the seal. But we must not stop paying attention to the data, which nevertheless represent the macroscopic element. What really matters, as a rule and here in particular, is what you do with the data, in other words the intelligence that applies to the data. And the underlying purpose.

May 26, 2022 – 12:00

It is no coincidence that information is obtained, we always read in the pronunciation, “in a structured form, which allows subsequent manipulation and analysis”. If we wanted to be at the forefront, we could say that the decision of the Court of Appeals sets out the scenario, for us dystopians, of a system in which this fundamental body of rules and principles established by the GDPR does not work. Absences are exactly what give us back the meaning and value of what we have. The organic legislation of the eurozone actually introduces a Copernican revolution in legal relations, building them starting from the person.

Not that the decision of the Ninth Circuit does not address the issue of privacy, but it does so with what, looking from our shore, seems to be an absolute lightness: from the fact that the data is publicly available on the internet it is concluded that, therefore, they were legally acquired by anyone and for any purpose, to the strange conclusion that if something is available you can make it your own. It is a conclusion we have seen work in the Italian public sector, despite the fact that here a strict legislation has been teaching the opposite for a quarter of a century.

This approach reflects a concept of personal data protection that is entirely committed to that of confidentiality, which has not been developed in the legally mature sense of recognizing an individual’s information, control and decision-making power over his or her data, in a word which has not developed into a sense of self-worth. The fact that the information is available to the public is of little importance, it counts instead because it is also within the rules of respect of the data subject. Finally, the basic concept of the principle of transparency and especially the limitation of purpose, dedicated to art. 5.1.b) GDPR.

The Guarantor’s background regarding web scraping

The principle of purpose limitation is so crucial that it has characterized sectoral legislation up to at least Directive 95/46. It is not surprising that we find it revoked or simply the basis of a series of provisions, even dating, of our Guarantor, some explicitly dedicated to tissue scraping. This reveals the progress, measurable over the years, that EU legislation has made in relation to non-European experiences.

For example, the GPDP provision of 14 January 2016 may be mentioned [6053915]which made the creation of a telephone directory illegal, not starting with the dbu, ie the single database of electronic communications providers, but consisting of the collection of information that is automatically collected through scenarios launched into certain online sources.

Equally equal, with layout [9105201]was the year 2019, the Authority censors the identification of electoral communication addresses that have accumulated through scrape data on the Web. So the reason was maximized: “The easy availability of personal data on the Internet […] does not imply their free availability nor authorizes the processing of such data for any purpose, but – in accordance with the principles of correctness and purpose (see Rule 5 (1) (a) and (b), Regulation) – only for purposes of their publicationFinally, we are the polar opposite of hiQ Labs pronunciation.

In addition, it would be wrong to overstate the legal significance of the Court of Appeal’s decision, as each case is linked to the procedural strategy that produced it. In this case, LinkedIn decided to leverage a regulatory source, the CFAA – Computer fraud and misuse law, challenging a similar behavior, wanting to make a comparison, with the one that punishes our art. 615-ter cp, unauthorized access to the computer system. In fact, he complained about the violation of his servers by hiQ Labs bots. Creating an identical complaint in a different way, placing it, for example – if and when it happens – in the CCPA could lead to a different result.

However, from a European perspective of effective protection, we are interested in understanding the extraterritorial application of the GDPR. The key in this regard is to look at the nearest precedent, namely the Clearview AI case: at that time the company, which specialized in artificial intelligence solutions based on web scraping, had fallen within the scope of EU law. how and with what limits.

The critical problem of the implementation of the GDPR beyond the EU

Here, the most subtle step is actually represented by the difficulty of locating, in view of a gross violation of the rights of Europeans, a legal link to enforce our protection rules. In the case of Clearview AI, this connection has proven to be particularly difficult and it is not certain that the solution found will withstand any judicial review. We are really on the verge of implementing, in a territorial sense, legislation.

The key point is that the company is based in the US and has no European factories, at least in terms of tissue scraping we are interested in, ie it is far outside the scope of EU law, according to the rules set out in paragraph 1 of Article . 3 GDPR. It remains to look at the second paragraph, the boldest one, the one that highlights our discipline at a global level.

But only ostensibly: it is true that this company collects, combines, composes, enriches personal data of subjects located in the Union, and it is true that this is a very intrusive activity in terms of rights, but this may not be enough. As is well known, the above provision sets out two criteria for fixing a legal phenomenon in the discipline of the regulation: that of directing the supply of goods or services to the Union market (but, technically, the management belongs to the stakeholders in the Union and not to those responsible the first hook in the Clearview AI case appeared to be very fragile, while the concept of monitoring allowed a more stable, though not necessarily peaceful, construction.

The same logic could be repeated for a hypothetical hiQ Labs case brought to the attention of our supervisor and others like it. Also, in these cases, and without prejudice to the slightest possibility of interception of tenders addressed to interested parties in the Union, it is precisely the concept of monitoring that seems to be crucial.

Now, as a rule, the recitals of the regulation play a fundamental role in the interpretation or rather in the regulatory complement.

In particular, recital 24 emphasizes, for the concept of monitoring, “any subsequent use of personal data processing techniques consisting of the creation of a profile of the natural person, in particular in making decisions concerning him or her or in analyzing or predicting his or her preferences, behavior and personal positions“.

The wording actually makes a distinction between activities that do not create profiles / predictions and activities that, in principle, attract applications of artificial intelligence within the scope of art. 3.2 GDPR. But only in principle: verification in practice is necessary and does not allow general criteria to be formulated.


The use of web scraping techniques finds an improved application in artificial intelligence environments and generally in all cases in which sophisticated analysis techniques are applied to the vast, objectively existing pool of publicly accessible information. For example, this is the case with OSINT, an acronym for Open Source INtelligence. For easy prediction, we will return to addressing these issues with increasing interest.

From a perspective, the need arises to define the fine line between personal protection and legal processing activities, in a complex context with a generally global perspective.

A fundamental ground then becomes that of building a framework of common protection beyond the Union.

Indeed, on the contrary, the authoritarian nature of identifying techniques that expand the territorial scope of our legislation, meets mandatory limits, both conceptually constructed and effective in legal protection.

It is deceptive to achieve, with the unique forces of this part of the world, that Chinese, American and Indian companies that draw value from public information comply with and are accountable to us.

June 8, 2022 – 12:00

