WEBVTT

00:00.000 --> 00:10.000
Thank you so much for having me.

00:10.000 --> 00:15.000
I don't have that much shape.

00:15.000 --> 00:17.000
Hi, everyone.

00:17.000 --> 00:18.000
My name is Sulean.

00:18.000 --> 00:28.000
I am a software developer at the Naturalis biodiversity center in Lydon in the Netherlands.

00:28.000 --> 00:35.000
I'm really excited to talk to you about a project that we've been working on called Disco, which is the distributed system of scientific collections.

00:35.000 --> 00:38.000
Yeah, so let's get started.

00:38.000 --> 00:41.000
Here is what I'm going to talk about today.

00:41.000 --> 00:49.000
First, I'm going to talk a little bit about what a natural history collection is or natural science collection is and why they are important.

00:49.000 --> 00:52.000
And the obstacles that we might face with digitizing them.

00:53.000 --> 01:03.000
Then we talk, I'm going to start talking about disco, my project and the kinds of problems that we can address to help this process and make data more accessible.

01:03.000 --> 01:10.000
Then I'm going to talk about enriching the data first by humans through annotations and then by machines.

01:10.000 --> 01:15.000
Lastly, I'm just going to give a little bit of what's next for disco.

01:15.000 --> 01:18.000
So what is a natural history collection?

01:18.000 --> 01:28.000
If you've ever been to a natural history museum or maybe a botemical garden, sometimes a zoo or an aquarium, these kinds of institutions often.

01:28.000 --> 01:40.000
The public facing part is only the tip of the iceberg and these institutions host these really rich and valuable collections of plants and animals and fungi and geological collections.

01:40.000 --> 01:46.000
And it's this really rich historical record going back hundreds of years when they were collected.

01:46.000 --> 02:04.000
Up until the present, a specimen will have important information like what is it, so the taxonomy, where was it found, so locality data, who collected it and when it was collected, so you get that historical record.

02:04.000 --> 02:12.000
Researchers, in biodiversity, will rely on these kinds of collections to identify organisms or specimens that they found.

02:12.000 --> 02:20.000
And so these kinds of collections, they allow them to determine what it is or if it's something entirely new.

02:20.000 --> 02:37.000
And so these natural history collections, they become this really rich reservoir, really important resource for researchers, not only helping us understand the past, but also inform policy decisions about really important issues like climate change and biodiversity loss.

02:37.000 --> 02:42.000
And just sort of dipping our toes beginning to talk about why it's important that these are digitized.

02:42.000 --> 02:53.000
I've got two example articles here, one is about a research project where they looked at digitized moss from across institutions.

02:53.000 --> 03:03.000
And the use that data to confirm a hundred-year-old, sorry, a hundred-year-old hypothesis about moss coloration.

03:03.000 --> 03:12.000
Yeah, and the second article is from the Museum for Natu Kunder in Berlin, excuse my German.

03:12.000 --> 03:22.000
And they looked into their collection and they found 40 specimens, sorry, 40 species previously unknown to science.

03:22.000 --> 03:26.000
These were just lying around in a box somewhere.

03:26.000 --> 03:35.000
And so it's really important that not only are they accessible to researchers to study, but we also need to know what is there.

03:35.000 --> 03:38.000
And so let's talk about digitizing.

03:38.000 --> 03:46.000
Unfortunately, huge parts of the natural science collections across Europe and the world are not digitized.

03:47.000 --> 04:00.000
This means that researchers often will have to go to the physical institutions, potentially several institutions, and that cost time, that cost money, that cost carbon, which is something that we want to be aware of in the biodiversity sector.

04:00.000 --> 04:08.000
And if things are digitized, that means often, these institutions have been doing things differently for hundreds of years.

04:08.000 --> 04:14.000
So there's some data disparities, data quirks that each institution has.

04:14.000 --> 04:21.000
So that means data wrangling and harmonization, which is a waste of researchers' time.

04:21.000 --> 04:29.000
These digitized Asian experts, when they do happen, it's very difficult to scale up and look out this out of European scale.

04:30.000 --> 04:37.000
Expertise for one is limited, so like I said, one of the most important things in assessment is what is it, what is the taxonomy.

04:37.000 --> 04:48.000
But we are facing right now a taxonomy shortage that we don't have the experts to go around to identify all of what we have, and things are going unidentified, even if we are digitizing them.

04:48.000 --> 04:51.000
The second thing, of course, it's expensive.

04:51.000 --> 04:59.000
These large-deciation projects, they are possible for the larger institutions, but smaller collections is a much bigger barrier.

04:59.000 --> 05:02.000
And finally, collaboration can be really difficult.

05:02.000 --> 05:08.000
Often, like I said, you have these data differences between institutions.

05:08.000 --> 05:15.000
That means any tools that you make that are designed to help this visualization process, they're siloed to one institution.

05:15.000 --> 05:20.000
And so yeah, we don't get to collaborate as much.

05:20.000 --> 05:24.000
So here is where I finally get to the thing that I am working on.

05:24.000 --> 05:33.000
Just go, it stands for the distributed system of scientific collections, and we are a European funded in development research infrastructure.

05:33.000 --> 05:47.000
So what I'm going to do is to increase access to European natural science collections across Europe, make them fair, make them and support digitization at scale.

05:47.000 --> 05:54.000
So I'm going to talk a little bit about where disco fits into this biodiversity data landscape.

05:54.000 --> 06:02.000
So I have my data providers, and that is CMS, which I realized is not a well known acronym, but it stands for collection management system.

06:02.000 --> 06:11.000
And the digital systems that different institutions would store their data in.

06:11.000 --> 06:14.000
And then on the right there, we have data consumers.

06:14.000 --> 06:22.000
So we have organizations that aggregate specimen data as well as other kinds of biodiversity data.

06:22.000 --> 06:31.000
And so what we envision is just go sitting in the middle, to talk enriching the data before they get to the data consumers.

06:31.000 --> 06:41.000
So what do we do? First, we harmonize data all incoming data to OpenDS, which is our in-house data specification based on existing data models.

06:41.000 --> 06:47.000
There's a couple of different big data models that you would see in the biodiversity sector.

06:47.000 --> 06:55.000
And we say no, we're going to do one. And of course, we can talk about that XKCD article or that XKCD comic.

06:55.000 --> 06:57.000
But we do our best.

06:57.000 --> 07:01.000
We assign every individual specimen, a digital object identifier.

07:01.000 --> 07:07.000
That means that it is a unique identifier for a specimen that is resolvable.

07:07.000 --> 07:10.000
It'll always resolve back to disco.

07:10.000 --> 07:13.000
And it'll always resolve back to disco.

07:13.000 --> 07:21.000
And that means that they can be cited individually as part of a research article about an expedition or something like that.

07:21.000 --> 07:27.000
We capture provenance. So what has changed in the source system data or beyond?

07:27.000 --> 07:34.000
We link data and extend it. So we link it to other institutions or sorry, other infrastructures.

07:34.000 --> 07:40.000
Because the specimen is interesting, but it's even more interesting if you can link it to the genomic sequences.

07:40.000 --> 07:42.000
Or the environmental data that it was found in.

07:42.000 --> 07:49.000
And so by linking all of these, you get this like extended digital object that is really rich.

07:49.000 --> 07:57.000
And finally, annotations, which are made by humans or machines, which is the main thing I'm excited to talk to you about.

07:57.000 --> 08:04.000
And so this is the core disco platform and on top of it, you can build services that are machine facing.

08:04.000 --> 08:07.000
You can also build services that are human facing.

08:07.000 --> 08:09.000
That's sort of where the distributed comes from.

08:09.000 --> 08:15.000
We develop this core data infrastructure and then our partners are developing different services on top.

08:16.000 --> 08:23.000
So annotations. We use, it's based on the W3C model.

08:23.000 --> 08:31.000
And essentially, it's own digital object attached to or associated with the digital specimen or immediate object.

08:31.000 --> 08:35.000
You can add information, delete information, assess, comment.

08:35.000 --> 08:43.000
You say all of that, it becomes, your opinion becomes a separate digital object and then it will be ready for review.

08:44.000 --> 08:48.000
You can annotate the entire specimen or specific part like the taxonomy.

08:48.000 --> 08:55.000
If you have an media object, you can annotate a region of interest, such as indicating that this is the antenna.

08:55.000 --> 09:00.000
Yeah, and so the process, this is how we envision the process to be going.

09:00.000 --> 09:08.000
So anything that is an adding, a deleting or an editing annotation, even though that information is stored in its own digital object,

09:08.000 --> 09:11.000
eventually we wanted to modify the target.

09:11.000 --> 09:16.000
So we don't want an annotation saying, hey, this field needs to be changed to this.

09:16.000 --> 09:20.000
We want to actually transform the object.

09:20.000 --> 09:24.000
So a user would select a specimen and annotate it.

09:24.000 --> 09:29.000
The annotation is discussed and approved by the collection manager or an expert.

09:29.000 --> 09:34.000
Then the specimen becomes updated with new information and that new information is published.

09:34.000 --> 09:41.000
Both back to the source system, the institution, and also to the data aggregators on the other side.

09:41.000 --> 09:43.000
Now, this looks great.

09:43.000 --> 09:48.000
The red box is in progress. This is our main goal for 2026.

09:48.000 --> 09:54.000
So in disco, we can capture the annotations from experts right now.

09:54.000 --> 10:03.000
So we have been partnering with researchers at Nutralis and this is a really good example that really helped us streamline.

10:03.000 --> 10:07.000
Our annotation process for users.

10:07.000 --> 10:14.000
So the Naturalis Papiota project was a digitization project at our institution of 300,000 butterflies stored in papiots,

10:14.000 --> 10:18.000
which are the triangular envelopes that you saw.

10:18.000 --> 10:22.000
So we had papiots in papiots.

10:22.000 --> 10:24.000
It was powered by volunteers.

10:24.000 --> 10:31.000
So these people, they extracted these very delicate, often centuries old butterflies with tweezers.

10:31.000 --> 10:34.000
They photographed them with the scale.

10:34.000 --> 10:38.000
They transcribes the date, the location, the collection ID.

10:38.000 --> 10:43.000
But there's some and any other information that was with the butterfly.

10:43.000 --> 10:45.000
But there's an important piece missing.

10:45.000 --> 10:46.000
It's the taxonomy.

10:46.000 --> 10:50.000
Now, if we had an army of taxonomists during our digitization,

10:50.000 --> 10:52.000
we would be a very lucky institution.

10:52.000 --> 10:57.000
But instead, unfortunately, we were our collection managers.

10:57.000 --> 11:01.000
We're sending spreadsheets back and forth by email,

11:01.000 --> 11:06.000
say, with a please fill out this taxonomy for this specimen,

11:06.000 --> 11:09.000
which obviously is so prone to human area.

11:09.000 --> 11:14.000
You want to make sure that you're using the identifying the right specimen

11:14.000 --> 11:16.000
and what if you do a typo,

11:16.000 --> 11:21.000
and then people would take those spreadsheets and then manually insert them

11:21.000 --> 11:23.000
back into our collection management system.

11:23.000 --> 11:28.000
So this process has much to be improved.

11:28.000 --> 11:34.000
So working with our target users, we identified a couple of ways

11:34.000 --> 11:36.000
that disco can help.

11:36.000 --> 11:39.000
The first was reducing human error with taxonomy.

11:39.000 --> 11:43.000
So it would be really helpful if we could automatically fill out taxonomy

11:43.000 --> 11:48.000
instead of relying on people not only getting the specimen name right,

11:48.000 --> 11:50.000
but also the family, the genes, the order, the class,

11:50.000 --> 11:53.000
all of the higher levels of taxonomy.

11:53.000 --> 11:55.000
We also want to capture transparency.

11:55.000 --> 11:57.000
So in the future when this annotation is accepted,

11:57.000 --> 12:01.000
we still want to know who made the annotation and who accepted it,

12:01.000 --> 12:03.000
why is this specimen like this now?

12:03.000 --> 12:05.000
And we want an ambiguous process.

12:05.000 --> 12:10.000
So we want to really be sure of what specimen you're actually identifying

12:10.000 --> 12:15.000
so the annotation is attached to the specimen.

12:15.000 --> 12:18.000
Okay, so here I have, I should have a video,

12:18.000 --> 12:21.000
but I thought, you know, it's foster, let's do a live demo.

12:21.000 --> 12:23.000
Let's live a little bit.

12:23.000 --> 12:24.000
So here we are.

12:24.000 --> 12:27.000
Hang on, I'm going to go back.

12:27.000 --> 12:31.000
Of course, this is where I try this while like 10 minutes ago,

12:31.000 --> 12:34.000
but of course, it's the Wi-Fi.

12:34.000 --> 12:36.000
And again.

12:40.000 --> 12:43.000
We're living too much.

12:48.000 --> 12:52.000
Okay.

12:52.000 --> 12:55.000
Well, we're going to go back to, okay.

12:55.000 --> 12:57.000
I'm sorry, too.

12:57.000 --> 12:58.000
Yeah.

13:06.000 --> 13:08.000
I think it might be a Wi-Fi issue.

13:08.000 --> 13:09.000
Change your Wi-Fi.

13:09.000 --> 13:10.000
Yeah.

13:19.000 --> 13:23.000
Okay, so this is the homepage of our sandbox environment,

13:23.000 --> 13:27.000
for Discover, which is the human interface for disco.

13:27.000 --> 13:31.000
So I can choose from a couple of different domains.

13:31.000 --> 13:34.000
I'm just going to click search and this will show us,

13:34.000 --> 13:36.000
if you want to go back to the website.

13:36.000 --> 13:38.000
Okay.

13:38.000 --> 13:41.000
Okay, so this is the homepage of our sandbox environment for Discover,

13:41.000 --> 13:44.000
which is the human interface for disco.

13:44.000 --> 13:47.000
So I can choose from a couple of different domains.

13:47.000 --> 13:50.000
I'll show us a couple of top specimens.

13:50.000 --> 13:53.000
And so what I'm interested in, if I'm a text on a missed,

13:53.000 --> 13:55.000
and a collection manager has asked me,

13:55.000 --> 14:00.000
hey, can you please identify these butterflies from this data set?

14:00.000 --> 14:03.000
I can go into source system,

14:03.000 --> 14:05.000
and I click on natural as biodeversity center,

14:05.000 --> 14:07.000
lepidoptera, the butterflies.

14:07.000 --> 14:10.000
That is the name of the data set we want to identify.

14:10.000 --> 14:14.000
And maybe because we're only interested in the,

14:15.000 --> 14:18.000
oh well.

14:18.000 --> 14:21.000
We're only interested in specimens that are missing a text on me.

14:21.000 --> 14:24.000
So let's go no genus.

14:24.000 --> 14:29.000
And let's click has media because the pictures are pretty.

14:29.000 --> 14:35.000
Okay.

14:35.000 --> 14:38.000
So this is the main specimen landing page.

14:38.000 --> 14:42.000
You can have a basic specimen information here.

14:42.000 --> 14:45.000
Sometimes if there are coordinates in the data,

14:45.000 --> 14:48.000
you have a geographical map that is rendered.

14:48.000 --> 14:50.000
It has information about the specimen host.

14:50.000 --> 14:53.000
This is all data that we've transformed from the source system.

14:53.000 --> 14:55.000
So we don't generally add anything new.

14:55.000 --> 14:57.000
You see, I am logged in here.

14:57.000 --> 15:01.000
I need to be logged in with an architect ID to make an annotation.

15:01.000 --> 15:03.000
And then I click annotate on,

15:03.000 --> 15:04.000
except it identification.

15:04.000 --> 15:06.000
This is the taxonomy block.

15:06.000 --> 15:10.000
So we're interested in adding some more information.

15:10.000 --> 15:14.000
Now unfortunately, I am not a butterfly expert.

15:14.000 --> 15:17.000
And I'm trying to think of like a butterfly scientific name.

15:17.000 --> 15:19.000
The only thing I know is like bombis,

15:19.000 --> 15:20.000
which is a bumblebee.

15:20.000 --> 15:24.000
So don't accept this annotation.

15:24.000 --> 15:27.000
But let's say I am a expert.

15:27.000 --> 15:30.000
And I'm starting this as a bumblebee.

15:30.000 --> 15:32.000
And so what you can see here.

15:32.000 --> 15:34.000
Oh, this is actually, this is an avis.

15:34.000 --> 15:36.000
This is a bird actually.

15:36.000 --> 15:37.000
I'm not very good at my job.

15:37.000 --> 15:40.000
I'm a bad taxonomist.

15:40.000 --> 15:44.000
So, but what you've seen there is that it's automatically filled up

15:44.000 --> 15:45.000
the genus and the family.

15:45.000 --> 15:49.000
And then also the class and the phylum as well.

15:49.000 --> 15:52.000
So it's filled up all of the higher levels of taxonomy.

15:52.000 --> 15:55.000
All I've had to do is fill out the scientific name.

15:55.000 --> 15:57.000
I click review on the annotation.

15:57.000 --> 16:00.000
And I can see all of the different things that have changed.

16:00.000 --> 16:06.000
So these are the differences between what the existing data is

16:06.000 --> 16:08.000
and what I am saying it is.

16:08.000 --> 16:10.000
And I submit the annotation.

16:10.000 --> 16:13.000
And then I have something that is much pending.

16:13.000 --> 16:16.000
And eventually, a collection manager should come and see this.

16:16.000 --> 16:17.000
And say, that's crazy.

16:17.000 --> 16:19.000
That's not a bird, reject.

16:19.000 --> 16:23.000
So that is my demo.

16:23.000 --> 16:26.000
Okay.

16:26.000 --> 16:29.000
So it's great that we are doing.

16:29.000 --> 16:32.000
It's great that we're doing annotations with humans.

16:32.000 --> 16:35.000
What if we could get machines to do the work as well?

16:35.000 --> 16:41.000
We call these services that make annotations, machine annotations, services,

16:41.000 --> 16:42.000
or mass.

16:42.000 --> 16:46.000
So if I start saying mass in this presentation, you know what I mean now.

16:46.000 --> 16:51.000
So it's great to use machines to do the work because they got to do the boring stuff.

16:51.000 --> 16:57.000
We save our precious, precious taxonomous to do the complicated edge cases.

16:57.000 --> 17:01.000
Like I said, disco has one single data model.

17:01.000 --> 17:11.000
That means a service adapted for disco can be applied to any institution that we partner with.

17:11.000 --> 17:13.000
It lets us reuse work.

17:13.000 --> 17:16.000
So researchers are constantly developing their own tools instead.

17:16.000 --> 17:21.000
They can select from a wide, hopefully wide platform of services.

17:21.000 --> 17:24.000
And we have a modular design.

17:24.000 --> 17:29.000
So disco was designed with the idea that other services should plug into the platform.

17:29.000 --> 17:34.000
So that means that anyone can adapt an existing service to disco.

17:34.000 --> 17:37.000
Which I can show you very briefly here.

17:37.000 --> 17:42.000
What is in blue here in the box is the core disco architecture.

17:42.000 --> 17:46.000
So when a user schedules a mass, a machine annotations service,

17:46.000 --> 17:49.000
that spins up what we call a wrapper service.

17:49.000 --> 17:53.000
And all this does is it takes the job and then it extracts important information.

17:53.000 --> 17:56.000
And it sends it to a value service.

17:56.000 --> 17:57.000
And this could be anything.

17:57.000 --> 18:02.000
This could be a complicated and sophisticated AI model that identifies species or reads labels

18:02.000 --> 18:04.000
or it could be something as a spell checker.

18:04.000 --> 18:07.000
That spelling is important.

18:07.000 --> 18:11.000
But so it receives that request.

18:11.000 --> 18:13.000
It spins out a result.

18:13.000 --> 18:19.000
And then this wrapper service takes those results and formats it as a disco annotation.

18:20.000 --> 18:23.000
So we designed this system.

18:23.000 --> 18:26.000
But we had only ever developed masses ourselves.

18:26.000 --> 18:28.000
So we partnered with Senkabur to see,

18:28.000 --> 18:30.000
hey, is this documentation good enough?

18:30.000 --> 18:32.000
Can you do this as well?

18:32.000 --> 18:35.000
And so in December 2024,

18:35.000 --> 18:40.000
they were able to integrate a service that they developed an AI model

18:40.000 --> 18:45.000
that captured pixel-level mass masks on herbarium sheets.

18:45.000 --> 18:50.000
And we integrated that with disco in our acceptance environment.

18:50.000 --> 18:53.000
And then we were feeling really confident.

18:53.000 --> 18:57.000
And we had a hackathon in 2025 with three days,

18:57.000 --> 19:00.000
three teams, and three machine annotations services

19:00.000 --> 19:04.000
were able to be integrated in disco at the time.

19:04.000 --> 19:07.000
What we were really testing was the integration process.

19:07.000 --> 19:10.000
So how was the documentation, what bounced to people hit?

19:11.000 --> 19:16.000
These masses are hackathon quality.

19:16.000 --> 19:20.000
But we were really happy with the results that people were able to so quickly

19:20.000 --> 19:22.000
integrated into our system.

19:22.000 --> 19:27.000
Finally, the most recent hackathon we had was an April 2025.

19:27.000 --> 19:33.000
This was a project based on the juubov capacity building project of AI

19:33.000 --> 19:38.000
for specimen labels, which uses AI to read specimen labels.

19:38.000 --> 19:41.000
And now, if we can get this into our pipeline,

19:41.000 --> 19:44.000
it really helps, it can really help this digitization process

19:44.000 --> 19:49.000
because all you need is a photo and an AI model reading the labels

19:49.000 --> 19:51.000
and creating these annotations.

19:51.000 --> 19:55.000
And then it's up to a human to correct any mistakes that it might make

19:55.000 --> 19:58.000
and accept or decline those annotations.

19:58.000 --> 20:03.000
I did record this one because the API is a little slow.

20:03.000 --> 20:11.000
But here I have a media object that I'm looking at.

20:11.000 --> 20:14.000
There's a label that we want to read.

20:14.000 --> 20:17.000
And so I go to the top button there.

20:17.000 --> 20:20.000
That's a pity.

20:20.000 --> 20:25.000
It's very little cut off, but I am selecting the machine annotations

20:25.000 --> 20:27.000
service I am interested in running.

20:27.000 --> 20:31.000
It's called Splat, the specimen automated label transcription.

20:31.000 --> 20:35.000
And then I run it.

20:35.000 --> 20:38.000
And then this wrapper service gets spun up in the disco architecture.

20:38.000 --> 20:42.000
And then that wrapper service calls the Splat API.

20:42.000 --> 20:44.000
And then the Splat API returns a result.

20:44.000 --> 20:47.000
And the wrapper service gives us an annotation.

20:47.000 --> 20:52.000
So if I go back to the specimen and I click on annotate here,

20:52.000 --> 20:56.000
I can see that it has created several annotations

20:56.000 --> 20:59.000
about what is read from the label.

20:59.000 --> 21:02.000
And what this one results me very.

21:02.000 --> 21:04.000
But it was a really good proof of concepts.

21:04.000 --> 21:09.000
And we want to partner further with other such services.

21:09.000 --> 21:14.000
So lastly, I just want to talk about what is next for disco.

21:14.000 --> 21:17.000
We want to work on accepting annotations.

21:17.000 --> 21:20.000
That's a huge part of what we are.

21:20.000 --> 21:22.000
It's a huge part of our value offering.

21:22.000 --> 21:25.000
We want to be able to export the annotations and publish them.

21:25.000 --> 21:28.000
We want to continue our collaboration with researchers.

21:28.000 --> 21:34.000
And we want to become a Eric, which is a European research infrastructure.

21:34.000 --> 21:38.000
Something, essentially means that we are a legal entity.

21:38.000 --> 21:41.000
And then we can make SLAs with other mass providers.

21:41.000 --> 21:46.000
I want to thank the people on the slide, the development team, the collaborators.

21:46.000 --> 21:52.000
Anyone at Naturalis that I've ever gotten a coffee with, my colleagues for coming here at 930 in the morning.

21:52.000 --> 21:55.000
And if you are interested in what we do, we would give,

21:55.000 --> 21:58.000
we have example masses and documentation.

21:58.000 --> 22:01.000
One formation, the QR code just goes to the slide.

22:01.000 --> 22:03.000
So you can click on the links.

22:03.000 --> 22:05.000
Thank you so much for your attention.

22:05.000 --> 22:06.000
Yeah, that's it.

22:07.000 --> 22:12.000
Thank you for your work.

22:12.000 --> 22:16.000
You can take them as you just repeat this.

22:16.000 --> 22:17.000
Yes.

22:17.000 --> 22:19.000
This is really cool.

22:19.000 --> 22:20.000
I'm taking a PhD at all.

22:20.000 --> 22:24.000
Thank you for your work.

22:24.000 --> 22:29.000
I ask a question that's like most general, which is annotations are often wrong.

22:29.000 --> 22:31.000
Even when people do the answer work.

22:31.000 --> 22:37.000
Or do you really like the way of saying that presentation could be improved?

22:37.000 --> 22:39.000
That's okay.

22:39.000 --> 22:48.000
So the question, thank you, was are we developing a service that people can reply to annotations essentially and say,

22:48.000 --> 22:52.000
Oh, this could be improved?

22:52.000 --> 22:55.000
In some regards, yes, you can.

22:55.000 --> 22:58.000
So any object in our service can be annotated.

22:58.000 --> 23:02.000
So you can annotate an annotation as well.

23:02.000 --> 23:07.000
It's also annotations aren't.

23:07.000 --> 23:14.000
Who can accept an annotation is still something that everybody wants to know we're working on a trust model who has control over the data,

23:14.000 --> 23:15.000
who owns the data.

23:15.000 --> 23:18.000
But that is something that we're keeping in mind.

23:18.000 --> 23:21.000
Also for machine annotations services.

23:21.000 --> 23:23.000
Oh, this machine made a slight mistake.

23:23.000 --> 23:29.000
Let's correct it instead of doing it ourselves.

23:29.000 --> 23:30.000
Hi.

23:30.000 --> 23:34.000
This is a question from the labs.

23:34.000 --> 23:37.000
So the text moment architecture changes, right?

23:37.000 --> 23:38.000
New papers come out.

23:38.000 --> 23:41.000
They redefine the file out to me at the same thing.

23:41.000 --> 23:44.000
Does that change get echoed in your argument?

23:44.000 --> 23:46.000
I'm so glad you asked that question.

23:46.000 --> 23:49.000
The question was, textonomy is always changing.

23:49.000 --> 23:51.000
How do we capture these changes in the architecture?

23:51.000 --> 23:56.000
So we use catalog of life as a backbone for that.

23:56.000 --> 24:04.000
That for those that don't know, it is a service that captures textonomy.

24:04.000 --> 24:09.000
And we harmonize everyone to the catalog of life's textonomy.

24:09.000 --> 24:12.000
So other institutions might have slight differences in opinions,

24:12.000 --> 24:15.000
but we like to have one standard.

24:15.000 --> 24:18.000
And so as catalog of life releases,

24:18.000 --> 24:20.000
there are new releases.

24:20.000 --> 24:23.000
We want to stay up to date with that.

24:23.000 --> 24:25.000
I'm saying there's a textonomy, sure,

24:25.000 --> 24:27.000
but the last two people have been taxonomous.

24:27.000 --> 24:29.000
Where have you been?

24:33.000 --> 24:34.000
When you show the demo,

24:34.000 --> 24:36.000
there was something in the filter.

24:36.000 --> 24:38.000
Does this meant better for like meat, meat,

24:38.000 --> 24:39.000
a level of anything?

24:39.000 --> 24:40.000
Oh, what is that?

24:40.000 --> 24:43.000
So what in my demo,

24:43.000 --> 24:46.000
I showed something called Mids level in the filters.

24:46.000 --> 24:47.000
What is Mids?

24:47.000 --> 24:51.000
Mids is a in development standard?

24:51.000 --> 24:52.000
Yeah, it's still in development.

24:52.000 --> 24:56.000
It stands for minimum information about a digital specimen.

24:56.000 --> 24:57.000
Mids.

24:57.000 --> 25:01.000
And essentially that describes the completeness of a specimen.

25:01.000 --> 25:05.000
So if it has an institution ID and a catalog number,

25:05.000 --> 25:06.000
we're mid zero.

25:06.000 --> 25:08.000
That's the minimum that we'll accept.

25:08.000 --> 25:11.000
If it has a little bit more information about locality,

25:11.000 --> 25:14.000
a little more rich data, it can get to level one.

25:14.000 --> 25:15.000
And then if it's really complete,

25:15.000 --> 25:16.000
it'll get to level two.

25:16.000 --> 25:18.000
And so the standard describes,

25:18.000 --> 25:22.000
it's essentially a completeness indicator for a specimen.

25:22.000 --> 25:24.000
So through the annotations,

25:24.000 --> 25:28.000
you can bang the level of the meat to one from one one.

25:28.000 --> 25:30.000
So the question was,

25:30.000 --> 25:33.000
can annotations increase Mids?

25:33.000 --> 25:37.000
Yeah, if you're annotating the targeted fields.

25:37.000 --> 25:42.000
Yeah, no, that's the idea behind this go is getting more enriched data.

25:42.000 --> 25:45.000
So increasing the completeness of the data.

25:45.000 --> 25:52.000
Yeah, so you mentioned the fact that you're getting data out of institutions,

25:52.000 --> 25:53.000
you just think collections,

25:53.000 --> 25:59.000
and that you plan to get those data or annotation back to the original system.

25:59.000 --> 26:01.000
So my question is large.

26:01.000 --> 26:04.000
You can take whatever detail on that,

26:04.000 --> 26:06.000
like how institutions,

26:06.000 --> 26:10.000
organizations are considering your work on the disk of platform.

26:10.000 --> 26:12.000
If there are tension in there,

26:12.000 --> 26:15.000
between the original system and yours.

26:15.000 --> 26:17.000
So the question is,

26:17.000 --> 26:21.000
is there tension between the source data system,

26:21.000 --> 26:24.000
so these institutions and what we're doing to the data?

26:24.000 --> 26:26.000
I would say no,

26:26.000 --> 26:29.000
we're trying to offer them a service and not

26:30.000 --> 26:32.000
replace them necessarily.

26:32.000 --> 26:36.000
The idea is that we don't want 300 different institutions to make changes.

26:36.000 --> 26:39.000
We want to do the changes for them once.

26:39.000 --> 26:41.000
I wouldn't say tension,

26:41.000 --> 26:44.000
but the question that we still do need to answer is,

26:44.000 --> 26:46.000
who can accept annotations and who can,

26:46.000 --> 26:49.000
yeah, who can make these changes.

26:49.000 --> 26:54.000
And this is something that we're working with the community to discover,

26:54.000 --> 26:58.000
to define a solution that really makes the most sense for the biodiversity research.

26:58.000 --> 27:01.000
Community.

27:01.000 --> 27:02.000
Yes?

27:02.000 --> 27:03.000
Yes?

27:03.000 --> 27:05.000
Do we do anything to Wikipedia,

27:05.000 --> 27:07.000
or would we do this a big,

27:07.000 --> 27:09.000
like, person or...

27:09.000 --> 27:12.000
The question is, are we doing any work with Wikimedia

27:12.000 --> 27:15.000
or Wikidata to disambiguate persons?

27:15.000 --> 27:17.000
Yes.

27:17.000 --> 27:19.000
So there are projects that are happening,

27:19.000 --> 27:20.000
I mean, all over,

27:20.000 --> 27:23.000
but specifically at our institution as well.

27:23.000 --> 27:27.000
And we would like to use their work and just sort of steal it,

27:27.000 --> 27:29.000
adapt it to disco,

27:29.000 --> 27:31.000
and then make it accessible to everyone.

27:31.000 --> 27:32.000
Because it is really important,

27:32.000 --> 27:36.000
they're not just the main collector gets credit,

27:36.000 --> 27:39.000
but also all of the people who worked with them.

27:39.000 --> 27:42.000
Very cool question.

27:42.000 --> 27:43.000
You mentioned,

27:43.000 --> 27:44.000
Hackathon?

27:44.000 --> 27:45.000
Yes.

27:45.000 --> 27:47.000
Is it something which usual?

27:47.000 --> 27:49.000
Is it close though?

27:49.000 --> 27:50.000
Is it open?

27:50.000 --> 27:51.000
Hackathons?

27:51.000 --> 27:52.000
The question was,

27:52.000 --> 27:53.000
Hackathons?

27:53.000 --> 27:55.000
And are the regular,

27:56.000 --> 27:57.000
are they close or are they open?

27:57.000 --> 27:58.000
I would say they are open.

27:58.000 --> 27:59.000
They are not regular.

27:59.000 --> 28:01.000
We are, oh, this is recorded.

28:01.000 --> 28:04.000
But we are hoping to have one this year,

28:04.000 --> 28:07.000
but nothing has been set in stone.

28:07.000 --> 28:10.000
But, yeah, they are, yeah.

28:10.000 --> 28:12.000
Thank you.

28:12.000 --> 28:13.000
Thank you.

28:13.000 --> 28:14.000
Thank you.

28:14.000 --> 28:15.000
Thank you.

28:15.000 --> 28:16.000
Thank you.