News (09/03/2022): The SemTab 2021 proceedings are out. Results and ground truths are available. Back to other SemTab editions.

SemTab 2021: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching

Tabular data in the form of CSV files is the common input format in a data analytics pipeline. However a lack of understanding of the semantic structure and meaning of the content may hinder the data analytics process. Thus gaining this semantic understanding will be very valuable for data integration, data cleaning, data mining, machine learning and knowledge discovery tasks. For example, understanding what the data is can help assess what sorts of transformation are appropriate on the data.

Tables on the Web may also be the source of highly valuable data. The addition of semantic information to Web tables may enhance a wide range of applications, such as web search, question answering, and knowledge base (KB) construction.

Tabular data to Knowledge Graph (KG) matching is the process of assigning semantic tags from Knowledge Graphs (e.g., Wikidata or DBpedia) to the elements of the table. This task however is often difficult in practice due to metadata (e.g., table and column names) being missing, incomplete or ambiguous.

The SemTab challenge aims at benchmarking systems dealing with the tabular data to KG matching problem, so as to facilitate their comparison on the same basis and the reproducibility of the results.

The 2021 edition of this challenge will be collocated with the 20th International Semantic Web Conference and the 16th International Workshop on Ontology Matching.


Datasets and Ground Truths

The ground truths are now open:

Target Knowledge Graphs: Schema.org (version: May 2021), DBPedia (version: 2016-10), Wikidata (version: 20210828)

The codes of the AICrowd evaluators are also available here.


SemTab @ ISWC 2021

See full ISWC program here with the relevant links to the sessions. Material from the SemTab sessions: posters and recorded oral presentations.

Results and Challenge Prizes

Results of all three rounds available here. Summary of SemTab 2021 results here.

Prizes sponsored by IBM Research:

Papers

SemTab 2021 papers have been published as Volume 3103 CEUR WS proceedings.

ISWC oral presentations

The results of the challenge will be presented on October 27 (Wednesday). Three teams will also present their systems.

October 27, Session 4D (EDT (US): 10:20-11:20. CET (EU): 16:20-17:20. CST (China): 22:20-23:20):

ISWC poster presentations

SemTab will be present during the ISWC Posters & Demos/Social sessions. We will use wonder.me together with the other ISWC Semantic Web challenges.

Posters:

Ontology Matching workshop poster presentations

SemTab will also be present at the Ontology Matching (OM) workshop on October 25 (14:30-15:30 CET). See full OM program here. We will also use wonder.me for the OM poster session (note that the wonder.me rooms are different).

Posters:

Participation: forum and registration

We have a discussion group for the challenge where we share the latest news with the participants and we discuss issues risen during the evaluation rounds.

Please register your system using this google form.

Note that participants can join SemTab at any Round for any of the tasks/tracks.


Challenge Tasks

Accuracy Track

As in previous editions, SemTab includes the following tasks organised into several evaluation rounds:

The challenge will be run with the support of the AICrowd platform and the STILTool system.

Datasets and tasks per round

Round 1:


Round 2:


Round 3:

Usability Track

This new track addresses a pain point in the community regarding a lack of publicly available easy-to-use and generic solution that will address the needs of a variety of applications and settings. We will devise a clear scoring mechanism to rank every participant's solution in terms of several usability criteria as judged by a review panel, for example:
  1. Is the solution open-source?
  2. Does the solution require specific platform that could affect its use in common settings?
  3. Does the solution require extensive training and tuning for a new application/domain?
  4. Is the solution offered as a public service?
  5. Does the solution include a well-designed user interface?

Applications Track

This new track aims at addressing applications in real-world settings that take advantage of the output of the matching systems. Challenging dataset proposals are also more than welcome.

Bio-Track: Due to advances in biological research techniques, new data is constantly being produced in the biomedical domain and it is commonly published unstructured or tabular formats. This data is not trivial to integrate semantically due not only to its sheer amount but also the complexity of the biological relations between entities. Specifically, for tabular data annotation, the representation of data can have a significant impact in performance since each entity can be represented by alphanumeric codes (e.g., chemical formulas or gene names) or even have multiple synonyms. Therefore, the domain would greatly benefit from automated methods to map entities, entity types and properties to existing datasets to speed-up the process of integrating new data in the domain.


Important Dates (tentative)


System Papers

We encourage participants to submit a system paper using easychair. The paper should be no more than 12 pages long (excluding references) and formatted using the LNCS Style. System papers will be reviewed by 1-2 challenge organisers.

Accepted system papers will be published as a volume of CEUR-WS. By submitting a paper, the authors accept the CEUR-WS publishing rules.


Organisation

This challenge is organised by Kavitha Srinivas (IBM Research), Ernesto Jiménez-Ruiz (City, University of London; University of Oslo), Oktie Hassanzadeh (IBM Research), Jiaoyan Chen (University of Oxford), Vasilis Efthymiou (FORTH - ICS), Vincenzo Cutrona (University of Milano - Bicocca), Juan Sequeda (data.world), Daniela Oliveira (Universidade de Lisboa), Catia Pesquita (Universidade de Lisboa), Nora Abdelmageed (University of Jena), and Madelon Hulsebos (University of Amsterdam). If you have any problems working with the datasets or any suggestions related to this challenge, do not hesitate to contact us via the discussion group.


Acknowledgements

The challenge is currently supported by the SIRIUS Centre for Research-driven Innovation and IBM Research.

BiodivTab is credited to Nora Abdelmageed, Sirko Schindler, Birgitta König-Ries, Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany. The tables provided in this challenge are based on real biodiversity research datasets, but have been adapted for the challenge. In the form provided here, they may be used for the challenge, only. Any publication on challenge results needs to contain citations of the underlying datasets. These citations will be made available after the challenge deadline.