WEBVTT

00:00.000 --> 00:25.440
We're switching a bit to the topic now. We're back in 2020. It's on OI here. At least not

00:26.360 --> 00:50.080
we're back in 2023 and we're talking about digital services act, which is, well, new legislation,

00:50.080 --> 00:57.920
but that kicked in, at the end of 2023. We are specifically focusing on a package that

00:57.920 --> 01:07.040
me and Luca, which is over there, developed to analyze the transparency database of the digital

01:07.040 --> 01:15.760
services act. A quick overview of digital services act is a legal framework that basically promotes

01:15.760 --> 01:22.640
transparency online for the platforms and especially for a very large platforms. It has

01:22.640 --> 01:30.320
many layers of transparency and let's say rights that have been given to the users of online

01:30.320 --> 01:36.720
platforms. For example, now we have to do, there is the obligation to have transparent terms and

01:36.720 --> 01:42.720
condition, which are clear to read that explains and risk assess the algorithm that

01:42.720 --> 01:50.720
the platform uses, such as the recommended system. Platform platforms have to explain the content

01:50.720 --> 01:57.760
moderation policies, they apply to the user content. Then there are additional consumer protection

01:57.760 --> 02:04.000
rights. For example, there cannot be targeting advertisement for miners. There is a mechanism now,

02:04.000 --> 02:09.920
which is mandatory for platforms for users to report illegal or incompatible content online.

02:10.480 --> 02:15.440
And then there are new transparency and that access provisions, which are going to focus on.

02:16.240 --> 02:23.120
For example, now, every time you go to an online shop, there is the obligation to share the

02:23.120 --> 02:29.360
selling information, which was not mandatory before, quite interestingly, and you can have also

02:29.360 --> 02:38.480
the details of the advertisement that is proposed to you by the platform. There are many transparency

02:38.480 --> 02:44.640
provisions. I will just quickly show you them. There are some transparency reports, which are

02:44.640 --> 02:50.880
beyond what reports about the content moderation activities of the platforms. The transparency

02:50.880 --> 02:56.880
database, which I present later in details, then, as we said, the terms and condition that we are

02:56.880 --> 03:02.240
tracking in collaboration with the time open terms archive. So we are tracking the changes that

03:02.240 --> 03:07.440
platform applies to them, how they treat the user data, how they present content to the user

03:08.400 --> 03:14.480
and then there are advertisement libraries, so every platform, every large platform,

03:14.480 --> 03:19.840
has to present the user, the repository, where all the information about all the advertisement

03:19.840 --> 03:25.440
that has been run on the platform can be freely seen by everyone at anything. And then there are

03:25.440 --> 03:32.480
others, which are very technical. I won't go into details, just know that for researcher, there is

03:32.480 --> 03:40.000
a new data axis provision, meaning that the vector researchers can get access to close data or

03:40.000 --> 03:47.600
private data from the companies under very strict conditions, but it is an unprecedented measure

03:47.600 --> 03:58.320
to scrutinize the activities of the platform. What does the provision do for example,

03:58.320 --> 04:04.240
the transparency reports gave for the first time another view of the content moderation

04:05.520 --> 04:11.840
human resources that platforms are allocating to moderate the content online, so we have a

04:11.840 --> 04:17.920
breakdown by language. We have the accuracy of the content moderation activities, etc. And so far,

04:17.920 --> 04:26.560
we had three rounds of these reports, and one is due next spring. And then we focus on the transparency

04:26.640 --> 04:34.240
database. The transparency database collects the anonymized version of all the content moderation

04:34.240 --> 04:40.320
decision that the platform takes against the user content, so the data life cycle is the user

04:40.320 --> 04:45.200
of course creates content on the platform, and then you have the platform that either by

04:46.160 --> 04:54.000
proactive decision or by not is from a user takes down or moderates the content, and then the

04:54.000 --> 05:04.320
platform under the DSA is obliged to notify the user on why and the causes and the reasons

05:04.320 --> 05:11.920
that caused the content to be moderated. And then it sends it to the user and an anonymized version

05:11.920 --> 05:18.000
to the transparency database. And what it looks like, it's like a very big JSON with all the

05:18.000 --> 05:24.320
information about the statement of reasons to content moderation, specifically finding the category,

05:24.320 --> 05:31.840
the automation that was used in the process, the pretext outlining, for example, the legal

05:31.840 --> 05:39.200
grounds, etc. And then once it's in database, you currently have three ways to look at this data.

05:39.200 --> 05:44.880
There is a website search that we offer, which is a kind of real time, but it's very limited in

05:44.880 --> 05:50.960
the scope and the only covers the last six months of data. And then there is an online dashboard

05:50.960 --> 05:57.360
that gives us an aggregate view of the data, but it's very limited in functionalities if you

05:57.360 --> 06:02.720
want to go deeper in the analysis. And then there are these daily dams, which are basically the

06:02.720 --> 06:10.880
daily volume of statements received by the database in a CSV damper, which are very big and

06:10.880 --> 06:17.840
requires a lot of pre and post processing. So our package basically focuses on this part of the

06:20.240 --> 06:29.040
of the pipeline and tries to optimize and streamline this streaming, this kind of analysis.

06:29.040 --> 06:35.760
So the content of the database is quite big, and our package cannot do like a very

06:36.400 --> 06:44.320
miracle, for example, now we are after about 25 billions of statements in the database.

06:44.320 --> 06:51.840
And as you can see, the biggest share of it is by one specific player. So even if you want to

06:51.840 --> 06:57.520
analyze them, you still need the package, you still need a very big machine, let's say,

06:59.120 --> 07:05.040
with a lot of throughputs, if you want to analyze the daily dams. And even the aggregated view of it

07:05.440 --> 07:13.760
by the categorical views, by the categorical variables, it's kind of two gigabyte in the end.

07:13.760 --> 07:19.280
So if we just remove the bigger player, you see that we have a breakdown of the content, which is

07:19.280 --> 07:29.760
like more heterogeneous. And so that said, that was just to say that this database is quite big,

07:30.080 --> 07:36.480
if you also account, for example, for the free text data that are in. So the coming back to our

07:36.480 --> 07:42.720
package, it's a package that can install in different ways. We provide the different

07:43.760 --> 07:49.760
venues. There is a Python package that you can directly install. We also provide out of the box,

07:50.480 --> 07:56.240
well, Docker container image that is exposing different ways to interact with it. I will show

07:56.240 --> 08:02.960
it in the last one. One of these is the best boarding capabilities. And we also offer, of course,

08:02.960 --> 08:08.800
interactive online documentation. As said, there are three ways to interact with the package,

08:08.800 --> 08:16.400
if you, for example, run the container images. There is an API interface, which offers a standard

08:16.400 --> 08:26.480
dyes-fast API interactive interface to try it out, the different queries that you can perform,

08:26.480 --> 08:30.960
which are basically the download, the filtering and the aggregation of the data that are

08:30.960 --> 08:37.120
found in database. The same functionalities are applied and mirrored by a common line interface,

08:37.120 --> 08:44.720
which is easily configured with some configuration file. And then you have an interactive way.

08:45.600 --> 08:50.240
Just to say that we will be in the workshop later. So if you want more details or you want a

08:50.240 --> 08:58.800
small demo, you can stay and we will be happy to provide one. So coming back to the third

08:58.800 --> 09:05.360
way to interact, there is also a dashboard link built on superset, the Apache dashboard link system,

09:06.240 --> 09:14.560
framework. And we just show some of the possible solutions that are like breakdown that you might

09:14.640 --> 09:24.880
be interested in. For example, I'm sorry, but the default font of superset is quite small,

09:24.880 --> 09:31.120
I have to say, but you can have, for example, a breakdown, very easily, of course, the platform,

09:31.120 --> 09:37.680
that meets the contents. For example, you have a TikTok, Amazon, Pinterest, Facebook, etc.

09:37.680 --> 09:44.080
And the category of the content that they were moderating. So for example, for most of the platform,

09:44.160 --> 09:50.640
this is just a scope of platform service, which is kind of a part two for them. And then you have

09:50.640 --> 10:00.400
other categories. You can also have breakdowns of in a time series or like compare the manual

10:00.400 --> 10:06.240
or automated content moderation from different platforms. So you can see like daily patterns

10:06.240 --> 10:12.480
and where the people are where platform are using automated or not content moderation.

10:12.560 --> 10:17.680
There are other breakdowns that I can show you later in the workshop, just to mention that

10:17.680 --> 10:25.120
there is a flourishing community about this in the research, in the community, in academic research

10:25.120 --> 10:33.840
community. And there will be an update of the database, late in July 2025, which mainly

10:33.840 --> 10:40.240
we didn't produce content identifier for illegal products that are moderated. These are our

10:40.240 --> 10:45.280
coordinates, if you are interested in and I said stay around for the workshop and the

10:45.280 --> 10:48.480
panel if you want to more information about this. Thank you very much.

10:55.040 --> 10:57.120
I don't think there is time for questions.

11:10.640 --> 11:16.640
Yes. I forgot to mention that I'm in Rico from the European Commission. I

11:16.640 --> 11:20.800
welcome, did you connect in the DSC enforcement team as a data scientist?

