Metadata on LaRS: designing metadata files for event datasets
ShareTIGR
10 ottobre 2024
Last week we examined the in-built metadata scheme of the LaRS repository and came to the conclusion that it was advisable to add separate metadata files to our datasets in order to complete the set of properties provided by the scheme. More specifically, it appeared that such an additional file would be helpful in event datasets, i.e. in those datasets that group video recordings, audio recordings and transcripts of one particular recorded interaction. It could be used not only to specify additional file properties (for example the type of de-identification measure taken in an audio file), but also to describe event properties inherited by all files of the dataset (e.g. the place and setting of the recording or a participant list) and to relate the files to questionnaire data about individual speakers. But how exactly should such a metadata file be structured and in which format should it be provided?
Looking for answers, we first explored the central registry for the Component Metadata Infrastructure (CMDI) of CLARIN. It contains metadata profiles composed of components, which are sets of unique or repeatable elements, which in turn can have attributes. These various parts form a hierarchically structured set of XML tags. A XML editor (CLARIN mentions Oxygen) can be used to fill in the desired text into each pair of tags, such as to create a metadata record that instantiates the profile. The available metadata schemes have been created by authors who manage their own or others’ linguistic resources, typically in a digital infrastructure that has the status of a ‘CLARIN center’. They are suitable for various types of data, depending on the authors’ needs. Authors can register both components and entire profiles, either by designing them from scratch or by reusing and modifying existing components and profiles. A special CLARIN guide (CMDI Best Practices) gives design recommendations, one of them being to not refer to specific corpora and projects, in order to make components applicable more generally and favor their reuse. It is possible, anyway, to range components and profiles in a group and identify a specific corpus or project via the name of the group.
Each CMDI element or component can be explained by choosing an intuitive name, by inserting a description into the dedicated ‘Documentation’ field, and by inserting a hyperlink to a CLARIN ‘Concept’. CLARIN concepts are stored in the Clarin Concept Registry CCR, which is a terminological dictionary (or ontology). They regard various domains of linguistics; those useful for the design of CMDI profiles are grouped under the heading ‘Metadata’.
Any author is free to design a profile and CLARIN provides an online CMDI editor that makes this task very easy. We decided to try it out. Browsing the CMDI registry, we found several ‘Event’ profiles that more or less closely matched our data. We chose one in particular, ‘DGDEvent’, to derive a new profile for TIGR events, checking the CLARIN concept links the authors had provided and adapting the categories to our needs. The result was slightly more complex than what was strictly necessary for the TIGR corpus, because we tried to achieve a compromise between accurate data modeling, reuse of existing components wherever possible, and applicability beyond the specific needs of our own corpus. When the new profile seemed sufficently advanced to be tested as a template to produce single metadata records, we published it as a draft in the registry’s ‘Development’ section.
We tried to figure out how to proceed, but noticed that we needed support. That’s when we contacted the LaRS @ SWISSUbase team. The team immediately scheduled a meeting with us and further advised us via email later on. We learnt that SWISSUbase was currently engaged in a longer process aiming at updating the LaRS metadata scheme so as to better account for multimodal data. Our interlocutors took note of our requests and promised they would consider them while revising the repository’s metadata scheme. Very good news for the scientific community! For TIGR, however, a solution had to be found in the short run. According to the LaRS team, using a CMDI profile to generate metadata was not the best choice. They flagged that CMDI’s XML format was little user-friendly, its main advantage being readability by larger corpus platforms and metadata catalogues, rather than readability than by human users of a repository. They recommended CSV tables or JSON files instead.
Based on that valuable advice, we transformed our metadata template in a set of six related CSV tables. Their overall structure and categories is inspired by the CMDI profile created earlier and includes (1) a main record that describes the event dataset in general; (2) a record that gives information about the TIGR corpus the event is part of; (3) a table that lists the event participants and indicates their roles in the interaction as well as associated clip-on microphones, where applicable; (4)-(6) tables that list the video files, audio files, and transcripts, and describe their main properties, including any de-identification procedure applied. Each table can be read separately. But the tables have data points in common, which enables linking the tables among each other, either mentally (when interpreted by a human reader) or by programming a relational database proper, for example in SQL. Also, records representing single events can easily be joined to form a set of overview tables of all TIGR events, which will be deposited in LaRS as a dataset on its own at study level. Following the same logic, we decided to place our questionnaire data (basic sociolinguistic information about the study participants) and approximate information about the event’s geographical location in two further tables, to be deposited as datasets at study level. Each speaker and each location is uniquely identified by an ID. Speaker IDs are associated to participant IDs in the events’ participant lists, whereas location IDs are referenced in the main event record. This organisation of the data makes it possible to correctly represent the fact that some speakers participate in more than one event and, similarly, in some locations we made more than one recording.
The path we took to produce our metadata template was not exactly linear, as we dwelled on the CMDI registry before getting help from the LaRS team and deciding to go for a set of tables. But the detour was instructive: it made us discover profiles and concepts adopted by colleagues, forced us to comparatively assess several formats (XML, JSON, CSV, relational databases), and raised our awareness about the functions and addressees of metadata.
Johanna Miecznikowski
Website:
CMDI Component Registry. https://catalog.clarin.eu/ds/ComponentRegistry/#/
Reference:
CMDI and Metadata Curation task forces of the Standing Committee on CLARIN Technical Centres: CMDI Best Practices Guide, v. 1.2.0.
https://www.clarin.eu/content/cmdi-best-practices-guide