EMBEDDIA Hackashop, a recapitulation

On April 19, we wrapped the EACL Hackashop on News Media Content Analysis and Automated Report Generation. The aim of Hackashop 2021 was to foster discussion and research on the combination of language technology and news media content. It provided a forum for both discussing scientific advances in the analysis of news stories and their reader comments and automated generation of reports, as well as for experimental work on identifying interesting phenomena in reader comments and reporting on them.

The hackashop was implemented in a dual format. A traditional track consisted of submission of scientific papers, their reviews, and finally paper presentations. It was complemented by an active, experimentation-based track consisting of an online hackathon preceding the workshop, with the presentation of the results in the joint workshop event. Both tracks shared the same topic, news media analysis, and generation, and participants to the two tracks had a good amount of overlap.

In the workshop track, we encouraged submissions of long and short papers. Based on three expert reviews for each submission, weighing the contributions of the submission against its length, 13 papers were selected for presentation in the workshop event.

The online hackathon was organized during a three-week period in February 2021, with six participating teams. The challenges they addressed covered a broad range, as each team had the freedom to define their own aims. In the spirit of providing a joint forum for discussing both scientific advances and experimental work, five hackathon teams submitted short reports to be included in this proceedings.

We were very happy to see several cross-disciplinary and cross-sector collaborations involving, e.g., computer scientists, social scientists, and the media industry, both in workshop papers and hackathon contributions. We were also happy to have numerous contributions that address multilingual settings and low-resource languages.

The workshop event on 19 April 2021 brought both tracks together, with presentations of both scientific workshop papers and empirical hackathon reports. We concluded the Hackashop with an excellent presentation of our keynote speaker, professor Neil Maiden.

We would once again like to thank all workshop paper authors and hackathon participants for their contributions to the hackashop! We are thankful to the programme committee members for their insightful reviews of the workshop papers. We are equally thankful to the large number of experts who made tools, models, data, and challenges available for the hackathon and provided support for the participants.

Authors: Hannu Toivonen and Michele Boggia

EMBEDDIA Hackathon wrap-up

On Friday, February 19, we wrapped up the EMBEDDIA hackathon. In an online event, the hackathon participants presented their results and their views on the EMBEDDIA tools and identified challenges.

We like to extend our gratitude to the hackathon participants and the EMBEDDIA staff for making the hackathon a success. It was a very nice opportunity for the EMBEDDIA consortium to see our developed tools being utilized outside of the consortium for similar or newly identified NLP challenges.

Below are snapshots of the wrap-up meeting.

EMBEDDIA Hackashop: Hackathon halfway get-together

On February 10, the EMBEDDIA consortium organized a hackathon get-together of hackathon participants and EMBEDDIA staff. We used this event to check-in with the teams and present the expectations and challenges of our media partner, the Finnish News Agency (STT).

The interaction with hackathon teams was conducted via the Gather.town application and it was in the form of a tool/model/data/challenge support session. We used the Gather.town application to make the event less formal and more social. Participants were able to wander around and meet other participants and see what they are working on — or to chat with other researchers from EMBEDDIA!

Below are snapshots of today’s event.

Kick-off of the Hackashop on news media content analysis and automated report generation

Today the EMBEDDIA consortium officially kicked-off the Hackashop on news media content analysis and automated report generation. Project partners presented the projects, challenges, and data to be used in the course of the hackashop. Due to the pandemic, the hackashop will be an online event. The hackathon part of the hackashop will run from February 1-21, 2021.

Below are some snapshots from today’s event.

Hackashop on news media content analysis and automated report generation – Call for workshop papers

The EMBEDDIA consortium is proud to announce the organization of the Hackashop on news media content analysis and automated report generation in conjunction with EACL 2021.

The Call for workshop papers is now published — more details are available here.

We welcome work broadly in the area of natural language processing of news media, addressing the various needs from the readers who consume news of their personal interest to journalists who keep track of what is going on in the world, try to understand what their readers think of various topics, or want to automate routine reporting.

Workshop on modern NLP through large pre-trained language models

EMBEDDIA partners from the Faculty of Computer and Information Science (University of Ljubljana) organized a workshop on modern NLP through large pre-trained language models on September 29th, 2020 in Ljubljana, Slovenia.

The workshop was primarily aimed at data scientists (academics, professionals, or students) that know some programming in Python and want to learn the basics of modern natural language processing. It was instructed by EMBEDDIA’s technical manager Marko Robnik-Šikonja and touched on the following subjects:

  • text preprocessing,
  • text representations,
  • basics of neural networks for text processing,
  • neural language models,
  • BERT and transformers,
  • hands-on (a downstream task with transformers): sentiment analysis, named entity recognition, text generation, etc.

EMBEDDIA tools standing out on international challenges

The EMBEDDIA team is glad to announce our tools are performing great at international challenges. 

The results in multilingual and social information of our semantic enrichment tools recently outperformed all other participants in the official rankings in all languages in: 

HIPE (Identifying Historical People, Places and other Entities) is a evaluation campaign on named entity processing on historical newspapers in French, German and English, organized in the context of the impresso project and run as a CLEF 2020 Evaluation Lab.

FinNum is a task for fine-grained numeral understanding in financial social media data – to identify the linking between the target cashtag and the target numeral.

Also! Our multilingual fake news spreader model (in English and Spanish) came out third (out of 66 participants) at this year’s PAN. You can find out more about the fake news spreader model on this link

EMBEDDIA at LREC 2020

We are pleased to present 6 EMBEDDIA publications accepted at this year’s Language Resources & Evaluation Conference (LREC 2020). Details of the submitted papers are presented below (will be updated with final versions in the beginning of May).

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift by Matej Martinc, Petra Kralj-Novak, and Senja Pollak

We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptation on large corpora. The results on the newly created Brexit news corpus suggest that the method can be successfully used for the detection of a short-term yearly semantic shift. And lastly, the model also shows promising results in a multilingual settings, where the task was to detect differences and similarities between diachronic semantic shifts in different languages.

Dataset for Temporal Analysis of English-French Cognates by Esteban Frossard, Mickael Coustaty, Antoine Doucet, Adam Jatowt, and Simon Hengchen

Languages change over time and, thanks to abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a database to investigate the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. To analyze this evolution, we select a set of cognates in both languages and study their temporal changes and correlations. We propose a new database for computational approaches of synchronized diachronic investigation of language pairs, and subsequent novel findings stemming from the cognates temporal comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language temporal analysis.

A Dataset for Multi-lingual Epidemiological Event Extraction by Esteban Mutuvi, Antoine Doucet, Gael Lejeune, and Moses Odeo

This paper proposes a corpus for development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for Information Extraction, but also for other Natural Language Processing tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (PROMED) platform, which provides current information about outbreaks of infectious disease globally. Among the key pieces of information present in the articles is the Uniform Resource Locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which include leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DANIEL) system. DANIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting repetition and saliency, to extract events. The system has a wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic related and non-related news articles that constitute the corpus.

Multilingual Culture-Independent Word Analogy Datasets by Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

High Quality ELMo Embeddings for Seven Less-Resourced Languages by Matej Ulčar and Marko Robnik-Šikonja

Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the size of training set and show that existing publicly available ELMo embeddings for listed languages shall be improved. We train new ELMo embeddings on much larger training sets and show their advantage over baseline non-contextual FastText embeddings. In evaluation, we use two benchmarks, the analogy task and the NER task.

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context by Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešič, Marko Robnik-Šikonja, Mark Granroth-Wilding, and Kristiina Vaik

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far.

Submit to our SemEval Task until the 12th of March

Our SemEval 2020 Task3: Predicting the (Graded) Effect of Context in Word Similarity is open for submissions. For this task, we ask participants to build systems to predict the effect that context has on human perception of similarity of words. Participants can submit their results until the 12th of March. 

In order to be able to look at these effects, we built several datasets where we asked annotators to score how similar a pair of words are after they have read a short paragraph (which contains the two words). Each pair is scored within two of these paragraphs, allowing us to look at changes in similarity ratings due to context. We built datasets, containing these contextual similarity ratings, in four different languages:

  • Croatian: HR
  • English: EN
  • Finnish: FI
  • Slovenian: SL

The pairs of words come from the well known SimLex999 dataset. The contexts are chosen so as to encourage different perceptions of similarity. Polysemy plays a role, however, we are especially interested in more subtle, graded changes in meaning. All data and examples are available on this link: https://competitions.codalab.org/competitions/20905 and more details here: https://arxiv.org/abs/1912.05320