Workshop on modern NLP through large pre-trained language models

EMBEDDIA partners from the Faculty of Computer and Information Science (University of Ljubljana) organized a workshop on modern NLP through large pre-trained language models on September 29th, 2020 in Ljubljana, Slovenia.

The workshop was primarily aimed at data scientists (academics, professionals, or students) that know some programming in Python and want to learn the basics of modern natural language processing. It was instructed by EMBEDDIA’s technical manager Marko Robnik-Šikonja and touched on the following subjects:

  • text preprocessing,
  • text representations,
  • basics of neural networks for text processing,
  • neural language models,
  • BERT and transformers,
  • hands-on (a downstream task with transformers): sentiment analysis, named entity recognition, text generation, etc.

EMBEDDIA tools standing out on international challenges

The EMBEDDIA team is glad to announce our tools are performing great at international challenges. 

The results in multilingual and social information of our semantic enrichment tools recently outperformed all other participants in the official rankings in all languages in: 

HIPE (Identifying Historical People, Places and other Entities) is a evaluation campaign on named entity processing on historical newspapers in French, German and English, organized in the context of the impresso project and run as a CLEF 2020 Evaluation Lab.

FinNum is a task for fine-grained numeral understanding in financial social media data – to identify the linking between the target cashtag and the target numeral.

Also! Our multilingual fake news spreader model (in English and Spanish) came out third (out of 66 participants) at this year’s PAN. You can find out more about the fake news spreader model on this link

EMBEDDIA at LREC 2020

We are pleased to present 6 EMBEDDIA publications accepted at this year’s Language Resources & Evaluation Conference (LREC 2020). Details of the submitted papers are presented below (will be updated with final versions in the beginning of May).

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift by Matej Martinc, Petra Kralj-Novak, and Senja Pollak

We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptation on large corpora. The results on the newly created Brexit news corpus suggest that the method can be successfully used for the detection of a short-term yearly semantic shift. And lastly, the model also shows promising results in a multilingual settings, where the task was to detect differences and similarities between diachronic semantic shifts in different languages.

Dataset for Temporal Analysis of English-French Cognates by Esteban Frossard, Mickael Coustaty, Antoine Doucet, Adam Jatowt, and Simon Hengchen

Languages change over time and, thanks to abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a database to investigate the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. To analyze this evolution, we select a set of cognates in both languages and study their temporal changes and correlations. We propose a new database for computational approaches of synchronized diachronic investigation of language pairs, and subsequent novel findings stemming from the cognates temporal comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language temporal analysis.

A Dataset for Multi-lingual Epidemiological Event Extraction by Esteban Mutuvi, Antoine Doucet, Gael Lejeune, and Moses Odeo

This paper proposes a corpus for development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for Information Extraction, but also for other Natural Language Processing tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (PROMED) platform, which provides current information about outbreaks of infectious disease globally. Among the key pieces of information present in the articles is the Uniform Resource Locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which include leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DANIEL) system. DANIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting repetition and saliency, to extract events. The system has a wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic related and non-related news articles that constitute the corpus.

Multilingual Culture-Independent Word Analogy Datasets by Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

High Quality ELMo Embeddings for Seven Less-Resourced Languages by Matej Ulčar and Marko Robnik-Šikonja

Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the size of training set and show that existing publicly available ELMo embeddings for listed languages shall be improved. We train new ELMo embeddings on much larger training sets and show their advantage over baseline non-contextual FastText embeddings. In evaluation, we use two benchmarks, the analogy task and the NER task.

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context by Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešič, Marko Robnik-Šikonja, Mark Granroth-Wilding, and Kristiina Vaik

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far.

Submit to our SemEval Task until the 12th of March

Our SemEval 2020 Task3: Predicting the (Graded) Effect of Context in Word Similarity is open for submissions. For this task, we ask participants to build systems to predict the effect that context has on human perception of similarity of words. Participants can submit their results until the 12th of March. 

In order to be able to look at these effects, we built several datasets where we asked annotators to score how similar a pair of words are after they have read a short paragraph (which contains the two words). Each pair is scored within two of these paragraphs, allowing us to look at changes in similarity ratings due to context. We built datasets, containing these contextual similarity ratings, in four different languages:

  • Croatian: HR
  • English: EN
  • Finnish: FI
  • Slovenian: SL

The pairs of words come from the well known SimLex999 dataset. The contexts are chosen so as to encourage different perceptions of similarity. Polysemy plays a role, however, we are especially interested in more subtle, graded changes in meaning. All data and examples are available on this link: https://competitions.codalab.org/competitions/20905 and more details here: https://arxiv.org/abs/1912.05320

EMBEDDIA at IFAM 2020

The EMBEDDIA project was promoted at the International Trade Fair for Automation and Mechatronics (IFAM) 2020, which took place on February 11-13 in Ljubljana, Slovenia. EMBEDDIA was featured at the fair stand of the Jožef Stefan Institute and at the seminar for Artificial Intelligence in Industry and Society, where members of the EMBEDDIA consortium, prof. dr. Nada Lavrač and prof. dr. Marko Robnik-Šikonja gave speeches.

Writen by: Martin Marzidovšek (JSI)

EMBEDDIA members visit to 24sata and Večernji list in Zagreb

On December 16th, 2019, members of the EMBEDDIA consortium visited the newsrooms of the two most popular newspapers in Croatia, 24sata and Večernji list.

The visit included a tour of the offices and discussions with the representatives of both newspapers. We discussed some of the hurdles the editors and the other employees face on a daily base. The visit was an excellent opportunity for members of the EMBEDDIA consortium to familiarize themselves with the processes in a newspaper. Researchers and editors explored possible artificial intelligence tasks addressing some of the challenges in the newsroom.

Presentation of EMBEDDIA at the Klagenfurt University

Dr. Petra Kralj Novak from the Jožef Stefan Institute had a TEWI KOLLOQUIUM at the Klagenfurt University with the title “Applications and Challenges of Sentiment and Stance Analysis” on Monday, December 9th. She presented some EMBEDDIA results, including hate speech detection and contextual embeddings for detecting diachronic semantic shift.

EMBEDDIA at the Naprej/Forward media festival

EMBEDDIA was presented at the 8th Naprej/Forward media festival in Ljubljana, Slovenia on November 21. Naprej/Forward is an annual festival organized by the Slovene Association of Journalists (Društvo novinarjev Slovenije). Each year the festival program contains a section called the Media harvest, which includes presentations of projects, stories or any kind of media content, that stood out in the country in the past year. EMBEDDIA, which is coordinated by the Slovene Jožef Stefan Institute (Institut Jožef Stefan), was among this year’s chosen projects. 

Our project’s coordinator Senja Pollak presented some general facts about EMBEDDIA, our team, partners and spoke about cross-lingual embeddings and the ways we are working on using them. She also showed a demo of a multilingual offensive speech detector, which was trained only for English data but worked for over 90 languages thanks to cross-lingual embeddings. After the presentation, she also answered questions from the audience, which mainly consisted of journalists and media content creators. 

Senja Pollak (JSI) presenting EMBEDDIA at the Naprej/Forward media festival. Foto: Kaja Brezočnik.
After the presentation, Senja answered questions from the audience, consisting mainly of journalists and media content creators. Foto: Kaja Brezočnik.

EMBEDDIA at the META-FORUM 2019

Author: Marko Robnik-Šikonja (UL)

META-FORUM is a series of annual conferences on language technologies with a focus on EU languages. The 2019 edition of the conference focused on the emerging European Language Grid (ELG) platform, which intends to become yellow pages for language resources and technologies.

The two-day conference (8th and 9th of September) gathered representatives of EU language-related projects, companies, and public institutions. Besides panel presentations and discussion forums, many projects prepared demonstrations of their ideas and progress. The EMBEDDIA project was represented by Matt Purver from the Queen Mary University of London, Andraž Pelicon from Jožef Stefan Institute, and Marko Robnik-Šikonja from the University of Ljubljana. The team showed a demo of multilingual offensive speech detector, which was trained only on English data but worked for over 90 languages thanks to cross-lingual embeddings. The demo received a warm welcome of its visitors.

As a part of pre-conference meetings, EU H2020 projects, financed in ICT-29 call, discussed closer cooperation with ELG. The ELG team intends to integrate different language services as docker images. The EMBEDDIA project presented resources and technologies it will build and challenges of integrating them into the ELG.

Matthew Purver (QMUL) presenting EMBEDDIA at the META-FORUM 2019. Photo source: European Language Grid (ELG).
Matthew Purver (QMUL) – EMBEDDIA Data Manager. Photo source: European Language Grid (ELG).
Andraž Pelicon (JSI) presenting a demo of multilingual offensive speech detector. The detector was trained only on English data but worked for over 90 languages thanks to cross-lingual embeddings. Photo source: European Language Grid (ELG).

EMBEDDIA at META-FORUM 2019 (photos)

EMBEDDIA was presented at the META-FORUM 2019 on October 8-9 2019 in Brussels, Belgium. Below are some photos from the event – more impressions to follow.


From left to right: Andraž Pelicon (JSI) and Matthew Purver (QMUL).