Software and Datasets
- Software is available from:
- EMBEDDIA Github organisation: github.com/EMBEDDIA
- EMBEDDIA Docker registry: git.texta.ee/texta
- Pre-trained ELMo models:
- ELMo embeddings for 7 languages (Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish): hdl.handle.net/11356/1277
- ELMo embeddings, Slovenian: hdl.handle.net/11356/1257
- Pre-trained BERT models:
- All EMBEDDIA BERT models available at Huggingface: huggingface.co/EMBEDDIA
- BERT for Croatian/Slovenian/English via CLARIN: hdl.handle.net/11356/1317
- BERT for Finnish/Estonian/English via CLARIN: urn.fi/urn:nbn:fi:lb-2020061201
- News article datasets:
- Ekspress Meedia news archive (c.1.4M articles in Estonian and Russian): hdl.handle.net/11356/1408
- Latvian Delfi Article Archive (c.180k articles in Latvian and Russian): hdl.handle.net/11356/1409
- Styria 24sata news archive (c.650k articles in Croatian): hdl.handle.net/11356/1410
- STT news archive (c.2.8M articles in Finnish): urn.fi/urn:nbn:fi:lb-2019041501
- News comment datasets:
- Ekspress Meedia Comment Archive (c.31M comments in Estonian and Russian): hdl.handle.net/11356/1401
- Latvian Delfi Comment Archive (c.12M comments in Latvian and Russian): hdl.handle.net/11356/1407
- Styria 24sata Comment Archive (c.20M comments in Croatian): hdl.handle.net/11356/1399
- Other datasets:
- Multi-lingual culture-independent word analogy dataset: hdl.handle.net/11356/1261
- CoSimLex context-dependent similarity dataset: hdl.handle.net/11356/1308
- Slovenian SimLex dataset: hdl.handle.net/11356/1309
Publications
Journal papers
- Stephen McGregor, Kat Agres, Karolina Rataj, Matthew Purver, and Geraint Wiggins (2019). Re-Representing Metaphor: Modelling Metaphor Perception Using Dynamically Contextual Distributional Semantics. Frontiers in Psychology, to appear.
- Blaž Škrlj, Jan Kralj, Nada Lavrač, and Senja Pollak (2019). Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture. Machine Learning and Knowledge Extraction 1(2): 575-589.
- Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt, and Marko Robnik-Šikonja (2019). Predicting Slovene Text Complexity Using Readability Measures. Contributions to Contemporary History 59.1.
- Matej Martinc and Senja Pollak (2019). Combining n-grams and deep convolutional features for language variety classification. Natural Language Engineering : 1-26.
- Andraž Repar, Vid Podpečan, Anže Vavpetič, Nada Lavrač, and Senja Pollak (2019). TermEnsembler: An enseble learning approach to bilingual term extraction and alignment. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 25(1):93-120.
- Andraž Repar, Matej Martinc, and Senja Pollak (2019). Replication, analysis and adaptation of a term alignment approach. Language resources and evaluation. https://doi.org/10.1007/s10579-019-09477-1.
- Marko Milosavljević, Melita Poler Kovačič, and Rok Čeferin (2020). In the name of the right to be forgotten : new legal and policy issues and practices regarding unpublishing requests in Slovenian online news media. Digital Journalism. https://doi.org/10.1080/21670811.2020.1747942.
- Marko Milosavljević and Igor Vobič (2019). Our task is to demystify fears” : analysing newsroom management of automation in journalism. Journalism. https://doi.org/10.1177/1464884919861598.
- Damjan Vavpotič, Marko Robnik-Šikonja, and Tomaž Hovelja (2019). Exploring the relations between net benefits of IT projects and CIOs’ perception of quality of software development disciplines. Business & Information Systems Engineering. https://doi.org/10.1007/s12599-019-00612-4.
- Igor Vobič, Marko Robnik Šikonja, and Monika Kalin Golob (2019). Back to the Future: Automation and the Transformation of Journalism Epistemology (in Slovene) / Nazaj v prihodnost: avtomatizacija in preobrazba novinarske epistemologije. Javnost 26:sup1:S41-S61. https://doi.org/10.1007/s12599-019-00612-4.
- Matteo Cinelli, Mauro Conti, Livio Finos, Francesco Grisolia, Petra Kralj Novak, Antonio Peruzzi, Maurizio Tesconi, Fabiana Zollo, and Walter Quattrociocchi (2019). (Mis)Information Operations: An Integrated Perspective. Journal of Information Warfare 18(3).
- Ester Appelgren and Carl-Gustav Linden (2020). Data Journalism as a Service: Digital Native Data Journalism Expertise and Product Development. Media and Communication. http://dx.doi.org/10.17645/mac.v8i2.2757.
- Saturnino Luz and Shane Sheehan (2020). Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge. Palgrave Communications 6(49). https://doi.org/10.1057/s41599-020-0423-6.
- Khalid Alnajjar and Hannu Toivonen (2020). Computational Generation of Slogans. Natural Language Engineering. https://doi.org/10.1017/S1351324920000236.
- Jey Han Lau, Carlos Santos Armendariz, Matthew Purver, and Shalom Lappin (2020). How Furiously Can Colourless Green Ideas Sleep: Sentence Acceptability in Context. Transactions of the Association for Computational Linguistics. https://doi.org/10.1162/tacl_a_00315 .
- Elvys Linhares Pontes, Stephane Huet, Juan-Manuel Torres Moreno, Thiago Gouveia da Silva, and Andrea Carneiro Linhares (2020). Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming. Computación y Sistemas 24(2).
- Leo Leppänen, Hanna Tuulonen, and Stefanie Sirén-Heikel (2020). Automated Journalism as a Source of and a Diagnostic Device for Bias in Reporting. Media and Communication 8(3):39-49. http://dx.doi.org/10.17645/mac.v8i3.3022.
- Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, and Senja Pollak (2020). tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification. Computer Speech & Language 65. https://doi.org/10.1016/j.csl.2020.101104.
- Ravi Shekhar, Marko Pranjić, Senja Pollak, Andraž Pelicon, and Matthew Purver (2020). Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian. Journal for Language Technology and Computational Linguistics 34(1):49-79.
- Lauri Haapanen and Leo Leppänen (2020). Recycling a genre for news automation: The production of Valtteri the Election Bot. AILA Review 33.1:67-85. https://doi.org/10.1075/aila.00030.haa.
- Nada Lavrač, Matej Martinc, Senja Pollak, Maruša Pompe Novak, and Bojan Cestnik (2020). Bisociative Literature‑Based Discovery: Lessons Learned and New Word Embedding Approach. New Generation Computing 38:773-800.
- Sebastian Mežnar, Nada Lavrač, and Blaž Škrlj (2020). SNoRe: Scalable Unsupervised Learning of Symbolic Node Representations. IEEE Access (8): 212568-212588. doi: 10.1109/ACCESS.2020.3039541
Conference papers
- Andraž Pelicon, Matej Martinc and Petra Kralj Novak (2019). Embeddia at SemEval-2019 Task 6: Detecting hate with neural network and transfer learning approaches. In Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval).
- Matej Martinc and Senja Pollak (2019). Pooled LSTM for Dutch cross-genre gender classification. In Proceedings of the Shared Task on Cross-Genre Gender Detection in Dutch at Computational Linguistic in the Netherlands (CLIN 2019) conference.
- Matej Martinc, Blaž Škrlj and Senja Pollak (2019). Who is hot and who is not? Profiling celebs on Twitter. In the Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum.
- Matej Martinc, Blaž Škrlj and Senja Pollak (2019). Fake or not: Distinguishing between bots, males and females. In the Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum.
- Khalid Alnajjar, Leo Leppänen, and Hannu Toivonen (2019). No Time Like the Present: Methods for Generating Colourful and Factual Multilingual News Headlines. In Proceedings of the 10th International Conference on Computational Creativity (pp. 258-265). Association for Computational Creativity.
- Jose G. Moreno, Elvys Linhares Pontes, Mickael Coustaty, and Antoine Doucet (2019). TLR at BSNLP2019: A multilingual named entity recognition system. Proceedings of the BSNLP-2019 Workshop, ACL 2019. pp: 83-88.
- Shamila Nasreen, Matthew Purver, and Julian Hough (2019). A Corpus Study on Questions, Responses and Misunderstanding Signals in Conversations with Alzheimer’s Patients. In Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue – Full Papers.
- Blaž Škrlj, Andraž Repar, and Senja Pollak (2019). RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 311-323.
- Blaž Škrlj and Senja Pollak (2019). Language comparison via network topology. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 112-123.
- Kristian Miok, Dong Nguyen-Doan, Blaž Škrlj, Daniela Zaharie, and Marko Robnik-Šikonja (2019). Prediction Uncertainty Estimation for Hate Speech Classification. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 286-298.
- Shamila Nasreen, Matthew Purver, and Julian Hough (2019). Interaction Patterns in Conversations with Alzheimer’s Patients. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019).
- Senja Pollak, Andraž Repar, Matej Martinc, and Vid Podpečan (2019). Karst exploration: Extracting terms and definitions from karst domain corpus. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 934-956.
- Dragana Miljkovic, Jan Kralj, Uroš Stepišnik, and Senja Pollak (2019). Communities of related terms in Karst terminology co-occurrence network. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 358-373.
- Shane Sheehan and Saturnino Luz (2019). Text Visualization for the Support of Lexicography-Based Scholarly Work. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 694-725.
- Morteza Rohanian, Julian Hough, and Matthew Purver (2019). Detecting Depression with Word-Level Multimodal Fusion. In Proceedings of Interspeech 2019. pp: 1443-1447.
- Kristian Miok, Dong Nguyen-Doan, Daniela Zaharie, and Marko Robnik-Šikonja (2019). Generating Data using Monte Carlo Dropout. In Proceedings of 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP 2019).
- Jani Marjanen, Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki (2019). Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings. In Proceedings of the 5th International Workshop on Computational History.
- Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki (2019). Word Clustering for Historical Newspapers Analysis. In Proceedings of the Workshop on Language Technology for Digital Historical Archives.
- Shane Sheehan, Pierre Albert, Masood Masoodian, and Saturnino Luz (2019). TeMoCo: A visualization tool for temporal analysis of multi-party dialogues in clinical settings. In Proceedings of the 32nd IEEE International Symposium on Computer-Based Medical Systems (CBMS).
- Anka Supej, Marko Plahuta, Matthew Purver, Michael Mathioudakis, and Senja Pollak (2019). Gender, language, and society: word embeddings as a reflection of social inequalities in linguistic corpora. In Znanost in družbe prihodnosti, Slovensko sociološko srečanje [Annual meeting of the Slovenian Sociological Association: Science and future societies], Bled, 18.-19. October 2019. Ljubljana: Slovensko sociološko društvo, p. 75-83.
- Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, and Daniela Zaharie (2019). Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders. The 7th IEEE International Conference on E-Health and Bioengineering – EHB 2019.
- Blaž Škrlj, Nada Lavrač, and Jan Kralj (2019). Symbolic Graph Embedding using Frequent Pattern Mining. In Proceedings of the International Conference on Discovery Science (DS2019).
- Matej Ulčar and Marko Robnik-Šikonja (2020). High-Quality ELMo Embeddings for Seven Less-Resourced Languages. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja (2020). Multilingual Culture-Independent Word Analogy Datasets. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Matej Martinc, Petra Kralj-Novak, and Senja Pollak (2020). Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Stephen Mutuvi, Antoine Doucet, Gael Lejeune, and Moses Odeo (2020). A Dataset for Multi-lingual Epidemiological Event Extraction. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Esteban Frossard, Mickael Coustaty, Antoine Doucet, Adam Jatowt, and Simon Hengchen (2020). A dataset for Temporal Analysis of English-French Cognates. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešič, Marko Robnik-Šikonja, Mark Granroth-Wilding, and Kristiina Vaik (2020). CoSimLex: A Resource for Evaluating Graded Word Similarity in Context. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Špela Vintar, Larisa Grčič Simeunovič, Matej Martinc, Senja Pollak, and Stepišnik (2020). Mining semantic relations from comparable corpora through intersections of word embeddings. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora at LREC2020. pp: 29-34.
- Elaine Zosa and Mark Granroth-Wilding (2019). Multilingual dynamic topic model. In Proceedings of RANLP 2019.
- Matej Martinc, Syrielle Montariol, Elaine Zosa, and Lidia Pivovarova (2020). Capturing Evolution in Word Usage: Just Add More Clusters? In Companion Proceedings of the Web Conference 2020.
- Senja Pollak, Vid Podpečan, Dragana Miljkovic, Uroš Stepišnik, and Špela Vintar (2020). The NetViz terminology visualization tool and the use cases in karstology domain modeling. In Proceedings of the 6th International workshop on computational terminology.
- Elaine Zosa, Mark Granroth-Wilding, and Lidia Pivovarova (2020). A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval. Workshop on cross-lingual search collocated with 12th Language Resources and Evaluation Conference (LREC2020). pp: 32-37.
- Elvys Linhares Pontes, Antoine Doucet, and José G. Moreno (2020). Linking Named Entities across Languages using Multilingual Word Embeddings. In Proceedings of the Joint Conference on Digital Libraries (JCDL 2020).
- Anka Supej, Matej Ulčar, Marko Robnik Šikonja, and Senja Pollak (2020). Dimenzija spola v slovenskih vektorskih vložitvah besed: primerjava modelov prek analogij poklicev. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 93-100.
- Marko Pranjić, Vid Podpečan, Marko Robnik-Šikonja, and Senja Pollak (2020). Evaluation of related news recommendations using document similarity methods. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 81-86.
- Marko Robnik-Šikonja, Kristjan Reba, and Igor Mozetič (2020). Cross-lingual Transfer of Twitter Sentiment Models Using a Common Vector Space. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 87-92.
- Špela Arhar Holdt, Senja Pollak, Marko Robnik-Šikonja, and Simon Krek (2020). Referenčni seznam pogostih splošnih besed za slovenščino. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 10-15.
- Boško Koloski, Senja Pollak, and Blaž Škrlj (2020). Know your Neighbors: Efficient Author Profiling via Follower Tweets. Notebook for PAN at CLEF 2020.
- Boško Koloski, Senja Pollak, and Blaž Škrlj (2020). Multilingual Detection of Fake News Spreaders via Sparse Matrix Factorization. Notebook for PAN at CLEF 2020.
- Emanuela Boros, Elvys Linhares Pontes, Luis Adrian Cabrera-Diego, Ahmed Hamdi, Jose G. Moreno, Nicolas Sidère, and Antoine Doucet (2020). Robust Named Entity Recognition and Linking on Historical Multilingual Documents. Working Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum (CLEF-HIPE 2020).
- Nela Petrželková, Blaž Škrlj, and Nada Lavrač (2020). Knowledge graph aware text classification. In Proceedings of the 23rd International Multiconference – IS2020.
- Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, and Senja Pollak (2020). COVID-19 Therapy Target Discovery with Context-Aware Literature Mining. In Proceedings of the 23rd International Conference on Discovery Science (DS 2020). pp: 109-123.
- Kristiina Vaik, Marit Asula, and Raul Sirel (2020). Hybrid Tagger – An Industry-driven Solution for Extreme Multi-label Text Classification. In Proceedings of the LREC2020 Industry Track. pp:26-30.
- Carlos Santos Armendariz, Matthew Purver, Senja Pollak, Nikola Ljubešić, Matej Ulčar, Marko Robnik-Šikonja, Ivan Vulić, and Mohammed Taher Pilehvar (2020). SemEval2020 Task 3: Graded Word Similarity in Context. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020). pp: 36-49.
Deliverables
Below is a list of submitted public deliverables of EMBEDDIA.
WP2 public deliverables | Due date |
D2.1: Datasets, benchmarks and evaluation metrics for advanced crosslingual NLP technology (T2.4) | 30/09/2019 |
D2.2: Initial cross-lingual semantic enrichment technology (T2.1) | 31/12/2019 |
D2.3: Initial keyword extraction techniques (T2.2) | 31/12/2019 |
D2.4: Multilingual language generation approach (T2.3) | 30/06/2020 |
WP5 public deliverables | Due date |
D5.1: Datasets, benchmarks and evaluation metrics for multilingual text generation (T5.4) | 30/09/2019 |
D5.2: Initial news generation technology (T5.1) | 30/06/2020 |
D5.3: Initial dynamic news generation technology (T5.2) | 30/06/2020 |
WP7 public deliverables | Due date |
D7.1: Project website and social media accounts (T7.1) | 31/03/2019 |