Software and Datasets
- Software is available from:
- EMBEDDIA Github organisation: github.com/EMBEDDIA
- EMBEDDIA Docker registry: git.texta.ee/texta
- Pre-trained ELMo models:
- ELMo embeddings for 7 languages (Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish): hdl.handle.net/11356/1277
- ELMo embeddings, Slovenian: hdl.handle.net/11356/1257
- Pre-trained BERT models:
- All EMBEDDIA BERT models available at Huggingface: huggingface.co/EMBEDDIA
- BERT for Croatian/Slovenian/English via CLARIN: hdl.handle.net/11356/1317
- BERT for Finnish/Estonian/English via CLARIN: urn.fi/urn:nbn:fi:lb-2020061201
- News article datasets:
- Ekspress Meedia news archive (c.1.4M articles in Estonian and Russian): hdl.handle.net/11356/1408
- Latvian Delfi Article Archive (c.180k articles in Latvian and Russian): hdl.handle.net/11356/1409
- Styria 24sata news archive (c.650k articles in Croatian): hdl.handle.net/11356/1410
- STT news archive (c.2.8M articles in Finnish): urn.fi/urn:nbn:fi:lb-2019041501
- News comment datasets:
- Ekspress Meedia Comment Archive (c.31M comments in Estonian and Russian): hdl.handle.net/11356/1401
- Latvian Delfi Comment Archive (c.12M comments in Latvian and Russian): hdl.handle.net/11356/1407
- Styria 24sata Comment Archive (c.20M comments in Croatian): hdl.handle.net/11356/1399
- Other datasets:
- Multi-lingual culture-independent word analogy dataset: hdl.handle.net/11356/1261
- CoSimLex context-dependent similarity dataset: hdl.handle.net/11356/1308
- Slovenian SimLex dataset: hdl.handle.net/11356/1309
- Keyword extraction datasets for Croatian, Estonian, Latvian & Russian: http://hdl.handle.net/11356/1403.
Publications
Journal papers
- Stephen McGregor, Kat Agres, Karolina Rataj, Matthew Purver, and Geraint Wiggins (2019). Re-Representing Metaphor: Modelling Metaphor Perception Using Dynamically Contextual Distributional Semantics. Frontiers in Psychology, to appear.
- Blaž Škrlj, Jan Kralj, Nada Lavrač, and Senja Pollak (2019). Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture. Machine Learning and Knowledge Extraction 1(2): 575-589.
- Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt, and Marko Robnik-Šikonja (2019). Predicting Slovene Text Complexity Using Readability Measures. Contributions to Contemporary History 59.1.
- Matej Martinc and Senja Pollak (2019). Combining n-grams and deep convolutional features for language variety classification. Natural Language Engineering : 1-26.
- Andraž Repar, Vid Podpečan, Anže Vavpetič, Nada Lavrač, and Senja Pollak (2019). TermEnsembler: An enseble learning approach to bilingual term extraction and alignment. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 25(1):93-120.
- Andraž Repar, Matej Martinc, and Senja Pollak (2019). Replication, analysis and adaptation of a term alignment approach. Language resources and evaluation. doi: https://doi.org/10.1007/s10579-019-09477-1.
- Marko Milosavljević, Melita Poler Kovačič, and Rok Čeferin (2020). In the name of the right to be forgotten : new legal and policy issues and practices regarding unpublishing requests in Slovenian online news media. Digital Journalism. doi: https://doi.org/10.1080/21670811.2020.1747942.
- Marko Milosavljević and Igor Vobič (2019). Our task is to demystify fears” : analysing newsroom management of automation in journalism. Journalism. doi: https://doi.org/10.1177/1464884919861598.
- Damjan Vavpotič, Marko Robnik-Šikonja, and Tomaž Hovelja (2019). Exploring the relations between net benefits of IT projects and CIOs’ perception of quality of software development disciplines. Business & Information Systems Engineering. doi: https://doi.org/10.1007/s12599-019-00612-4.
- Igor Vobič, Marko Robnik Šikonja, and Monika Kalin Golob (2019). Back to the Future: Automation and the Transformation of Journalism Epistemology (in Slovene) / Nazaj v prihodnost: avtomatizacija in preobrazba novinarske epistemologije. Javnost 26:sup1:S41-S61. https://doi.org/10.1007/s12599-019-00612-4.
- Matteo Cinelli, Mauro Conti, Livio Finos, Francesco Grisolia, Petra Kralj Novak, Antonio Peruzzi, Maurizio Tesconi, Fabiana Zollo, and Walter Quattrociocchi (2019). (Mis)Information Operations: An Integrated Perspective. Journal of Information Warfare 18(3).
- Ester Appelgren and Carl-Gustav Linden (2020). Data Journalism as a Service: Digital Native Data Journalism Expertise and Product Development. Media and Communication. doi: http://dx.doi.org/10.17645/mac.v8i2.2757.
- Saturnino Luz and Shane Sheehan (2020). Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge. Palgrave Communications 6(49). doi: https://doi.org/10.1057/s41599-020-0423-6.
- Khalid Alnajjar and Hannu Toivonen (2020). Computational Generation of Slogans. Natural Language Engineering. doi: https://doi.org/10.1017/S1351324920000236.
- Jey Han Lau, Carlos Santos Armendariz, Matthew Purver, and Shalom Lappin (2020). How Furiously Can Colourless Green Ideas Sleep: Sentence Acceptability in Context. Transactions of the Association for Computational Linguistics. doi: https://doi.org/10.1162/tacl_a_00315 .
- Elvys Linhares Pontes, Stephane Huet, Juan-Manuel Torres Moreno, Thiago Gouveia da Silva, and Andrea Carneiro Linhares (2020). Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming. Computación y Sistemas 24(2).
- Leo Leppänen, Hanna Tuulonen, and Stefanie Sirén-Heikel (2020). Automated Journalism as a Source of and a Diagnostic Device for Bias in Reporting. Media and Communication 8(3):39-49. doi: http://dx.doi.org/10.17645/mac.v8i3.3022.
- Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, and Senja Pollak (2020). tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification. Computer Speech & Language 65. doi: https://doi.org/10.1016/j.csl.2020.101104.
- Ravi Shekhar, Marko Pranjić, Senja Pollak, Andraž Pelicon, and Matthew Purver (2020). Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian. Journal for Language Technology and Computational Linguistics 34(1):49-79.
- Lauri Haapanen and Leo Leppänen (2020). Recycling a genre for news automation: The production of Valtteri the Election Bot. AILA Review 33.1:67-85. doi: https://doi.org/10.1075/aila.00030.haa.
- Nada Lavrač, Matej Martinc, Senja Pollak, Maruša Pompe Novak, and Bojan Cestnik (2020). Bisociative Literature‑Based Discovery: Lessons Learned and New Word Embedding Approach. New Generation Computing 38:773-800.
- Sebastian Mežnar, Nada Lavrač, and Blaž Škrlj (2020). SNoRe: Scalable Unsupervised Learning of Symbolic Node Representations. IEEE Access (8): 212568-212588. doi: 10.1109/ACCESS.2020.3039541
- Eleni Gregoromichelaki, Gregory James Mills, Christine Howes, Arash Eshghi, Stergios Chatzikyriakidis, Matthew Purver, Ruth Kempson, Ronnie Cann, and Patrick GT Healey (2020). Completability vs (In)completeness. Acta Linguistica Hafniensia 52(2): 260-284, doi: 10.1080/03740463.2020.1795549.
- Miok Kristian, Škrlj Blaž, Zaharie Daniela, and Marko Robnik-Šikonja (2021). To BAN or Not to BAN: Bayesian Attention Networks for Reliable Hate Speech Detection. Cognitive Computation. doi: https://doi.org/10.1007/s12559-021-09826-9
- Blaž Škrlj, Matej Martinc, Nada Lavrač, and Senja Pollak (2021). autoBOT: evolving neuro-symbolic representations for explainable low resource text classification. Machine Learning. doi: https://doi.org/10.1007/s10994-021-05968-x
- Matej Martinc, Blaž Škrlj, and Senja Pollak (2021). TNT-KID: Transformer-based neural tagger for keyword identification. Natural Language Engineering. doi: https://doi.org/10.1017/S1351324921000127
- Andraž Pelicon, Ravi Shekhar, Blaž Škrlj, Matthew Purver, and Senja Pollak (2021). Investigating Cross-lingual Training for Offensive Language Detection. PeerJ Computer Science. doi: https://doi.org/10.7717/peerj-cs.559
- Matej Ulčar, Anka Supej, Marko Robnik-Šikonja, and Senja Pollak (2021). Slovene and Croatian word embeddings in terms of gender occupational analogies. Slovenščina 2.0: empirical, applied and interdisciplinary research. doi: https://doi.org/10.4312/slo2.0.2021.1.26-59
- Aleš Žagar and Marko Robnik-Šikonja (2021). Cross-lingual transfer of abstractive summarizer to less-resource language. Journal of Intelligent Information Systems. doi: https://doi.org/10.1007/s10844-021-00663-8
- Matthew Purver, Mehrnoosh Sadrzadeh, Ruth Kempson, Gijs Wijnholds, and Julian Hough (2021). Incremental Composition in Distributional Semantics. Journal of Logic, Language & Information 30:379-406. doi: https://doi.org/10.1007/s10849-021-09337-8
- Elvys Linhares Pontes, Luis Adrian Cabrera-Diego, Jose G. Moreno, Emanuela Boros, Ahmed Hamdi, Antoine Doucet, Nicolas Sidere, Mickael Coustaty (2021). MELHISSA: A Multilingual Entity Linking Architecture for Historical Press Articles. International Journal on Digital Libraries. doi: https://doi.org/10.1007/s00799-021-00319-6
- Marko Robnik-Šikonja, Kristjan Reba, and Igor Mozetič (2021). Cross-lingual Transfer of Sentiment Classifiers. Slovenščina 2.0. doi: https://doi.org/10.4312/slo2.0.2021.1.1-25.
- Tadej Škvorc, Polona Gantar, and Marko Robnik-Šikonja (2022). MICE: Mining Idioms with Contextual Embeddings. Knowledge-Based Systems 235. doi: https://doi.org/10.1016/j.knosys.2021.107606.
- Matej Klemen, Luka Krsnik, and Marko Robnik-Šikonja (2022). Enhancing deep neural networks with morphological information. Natural Language Engineering. doi: https://doi.org/10.1017/S1351324922000080
- Nada Lavrač, Blaž Škrlj, and Marko Robnik-Šikonja (2020). Propositionalization and embeddings: two sides of the same coin. Machine Learning 109.7. doi: https://doi.org/10.1007/s10994-020-05890-8
- Tadej Škvorc, Nada Lavrač, and Marko Robnik-Šikonja (2022). NeSyChair: Automatic Conference Scheduling Combining Neuro-Symbolic Representations and Constrained Clustering. IEEE Access 10. doi: https://doi.org/10.1109/ACCESS.2022.3144932
- Matej Martinc, Senja Pollak, Marko Robnik-Šikonja (2021). Supervised and unsupervised neural approaches to text readability. Computational Linguistics 47.1. doi: https://doi.org/10.1162/coli_a_00398
- Boshko Koloski, Tomaž Stepišnik-Perdih, Marko Robnik-Šikonja, Senja Pollak, and Blaž Škrlj (2022). Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles. Neurocomputing Journal. doi: https://doi.org/10.1016/j.neucom.2022.01.096
- Carl-Gustav Lindén (2020). What makes a reporter human? A Research Agenda for Augmented Journalism. Questions de communication. https://doi.org/10.4000/questionsdecommunication.23301
- Carl-Gustav Lindén, Katja Lehtisaari, Mikko Grönlund, and Mikko Villi (2021). Journalistic Passion as Commodity: A Managerial Perspective. Journalism Studies 22(12), pp: 1701-1719. doi: https://doi.org/10.1080/1461670X.2021.1911672.
- Marit Asula, Jane Makke, Linda Freienthal, Hele-Andra Kuulmets and Raul Sirel (2021). Kratt: Developing an Automatic Subject Indexing Tool for the National Library of Estonia. Cataloging & Classification Quarterly 59:8, pp: 775-793. doi: https://doi.org/10.1080/01639374.2021.1998283.
- Matej Ulčar and Marko Robnik-Šikonja (2022). Cross-lingual alignments of ELMo contextual embeddings. Neural Computing and Applications. doi: https://doi.org/10.1007/s00521-022-07164-x.
Conference papers
- Andraž Pelicon, Matej Martinc and Petra Kralj Novak (2019). Embeddia at SemEval-2019 Task 6: Detecting hate with neural network and transfer learning approaches. In Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval).
- Matej Martinc and Senja Pollak (2019). Pooled LSTM for Dutch cross-genre gender classification. In Proceedings of the Shared Task on Cross-Genre Gender Detection in Dutch at Computational Linguistic in the Netherlands (CLIN 2019) conference.
- Matej Martinc, Blaž Škrlj and Senja Pollak (2019). Who is hot and who is not? Profiling celebs on Twitter. In the Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum.
- Matej Martinc, Blaž Škrlj and Senja Pollak (2019). Fake or not: Distinguishing between bots, males and females. In the Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum.
- Khalid Alnajjar, Leo Leppänen, and Hannu Toivonen (2019). No Time Like the Present: Methods for Generating Colourful and Factual Multilingual News Headlines. In Proceedings of the 10th International Conference on Computational Creativity (pp. 258-265). Association for Computational Creativity.
- Jose G. Moreno, Elvys Linhares Pontes, Mickael Coustaty, and Antoine Doucet (2019). TLR at BSNLP2019: A multilingual named entity recognition system. Proceedings of the BSNLP-2019 Workshop, ACL 2019. pp: 83-88.
- Shamila Nasreen, Matthew Purver, and Julian Hough (2019). A Corpus Study on Questions, Responses and Misunderstanding Signals in Conversations with Alzheimer’s Patients. In Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue – Full Papers.
- Blaž Škrlj, Andraž Repar, and Senja Pollak (2019). RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 311-323.
- Blaž Škrlj and Senja Pollak (2019). Language comparison via network topology. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 112-123.
- Kristian Miok, Dong Nguyen-Doan, Blaž Škrlj, Daniela Zaharie, and Marko Robnik-Šikonja (2019). Prediction Uncertainty Estimation for Hate Speech Classification. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019). pp: 286-298.
- Shamila Nasreen, Matthew Purver, and Julian Hough (2019). Interaction Patterns in Conversations with Alzheimer’s Patients. In Proceedings of the 7th International Conference on Statistical Language and Speech Processing (SLSP2019).
- Senja Pollak, Andraž Repar, Matej Martinc, and Vid Podpečan (2019). Karst exploration: Extracting terms and definitions from karst domain corpus. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 934-956.
- Dragana Miljkovic, Jan Kralj, Uroš Stepišnik, and Senja Pollak (2019). Communities of related terms in Karst terminology co-occurrence network. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 358-373.
- Shane Sheehan and Saturnino Luz (2019). Text Visualization for the Support of Lexicography-Based Scholarly Work. In Proceedings of the 6th biennial conference on electronic lexicography, eLex 2019. pp: 694-725.
- Morteza Rohanian, Julian Hough, and Matthew Purver (2019). Detecting Depression with Word-Level Multimodal Fusion. In Proceedings of Interspeech 2019. pp: 1443-1447.
- Kristian Miok, Dong Nguyen-Doan, Daniela Zaharie, and Marko Robnik-Šikonja (2019). Generating Data using Monte Carlo Dropout. In Proceedings of 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP 2019).
- Jani Marjanen, Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki (2019). Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings. In Proceedings of the 5th International Workshop on Computational History.
- Lidia Pivovarova, Elaine Zosa, and Jussi Kurunmäki (2019). Word Clustering for Historical Newspapers Analysis. In Proceedings of the Workshop on Language Technology for Digital Historical Archives.
- Shane Sheehan, Pierre Albert, Masood Masoodian, and Saturnino Luz (2019). TeMoCo: A visualization tool for temporal analysis of multi-party dialogues in clinical settings. In Proceedings of the 32nd IEEE International Symposium on Computer-Based Medical Systems (CBMS).
- Anka Supej, Marko Plahuta, Matthew Purver, Michael Mathioudakis, and Senja Pollak (2019). Gender, language, and society: word embeddings as a reflection of social inequalities in linguistic corpora. In Znanost in družbe prihodnosti, Slovensko sociološko srečanje [Annual meeting of the Slovenian Sociological Association: Science and future societies], Bled, 18.-19. October 2019. Ljubljana: Slovensko sociološko društvo, p. 75-83.
- Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, and Daniela Zaharie (2019). Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders. The 7th IEEE International Conference on E-Health and Bioengineering – EHB 2019.
- Blaž Škrlj, Nada Lavrač, and Jan Kralj (2019). Symbolic Graph Embedding using Frequent Pattern Mining. In Proceedings of the International Conference on Discovery Science (DS2019).
- Matej Ulčar and Marko Robnik-Šikonja (2020). High-Quality ELMo Embeddings for Seven Less-Resourced Languages. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja (2020). Multilingual Culture-Independent Word Analogy Datasets. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Matej Martinc, Petra Kralj-Novak, and Senja Pollak (2020). Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Stephen Mutuvi, Antoine Doucet, Gael Lejeune, and Moses Odeo (2020). A Dataset for Multi-lingual Epidemiological Event Extraction. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Esteban Frossard, Mickael Coustaty, Antoine Doucet, Adam Jatowt, and Simon Hengchen (2020). A dataset for Temporal Analysis of English-French Cognates. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešič, Marko Robnik-Šikonja, Mark Granroth-Wilding, and Kristiina Vaik (2020). CoSimLex: A Resource for Evaluating Graded Word Similarity in Context. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020).
- Špela Vintar, Larisa Grčič Simeunovič, Matej Martinc, Senja Pollak, and Stepišnik (2020). Mining semantic relations from comparable corpora through intersections of word embeddings. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora at LREC2020. pp: 29-34.
- Elaine Zosa and Mark Granroth-Wilding (2019). Multilingual dynamic topic model. In Proceedings of RANLP 2019.
- Matej Martinc, Syrielle Montariol, Elaine Zosa, and Lidia Pivovarova (2020). Capturing Evolution in Word Usage: Just Add More Clusters? In Companion Proceedings of the Web Conference 2020.
- Senja Pollak, Vid Podpečan, Dragana Miljkovic, Uroš Stepišnik, and Špela Vintar (2020). The NetViz terminology visualization tool and the use cases in karstology domain modeling. In Proceedings of the 6th International workshop on computational terminology.
- Elaine Zosa, Mark Granroth-Wilding, and Lidia Pivovarova (2020). A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval. Workshop on cross-lingual search collocated with 12th Language Resources and Evaluation Conference (LREC2020). pp: 32-37.
- Elvys Linhares Pontes, Antoine Doucet, and José G. Moreno (2020). Linking Named Entities across Languages using Multilingual Word Embeddings. In Proceedings of the Joint Conference on Digital Libraries (JCDL 2020).
- Anka Supej, Matej Ulčar, Marko Robnik Šikonja, and Senja Pollak (2020). Dimenzija spola v slovenskih vektorskih vložitvah besed: primerjava modelov prek analogij poklicev. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 93-100.
- Marko Pranjić, Vid Podpečan, Marko Robnik-Šikonja, and Senja Pollak (2020). Evaluation of related news recommendations using document similarity methods. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 81-86.
- Marko Robnik-Šikonja, Kristjan Reba, and Igor Mozetič (2020). Cross-lingual Transfer of Twitter Sentiment Models Using a Common Vector Space. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 87-92.
- Špela Arhar Holdt, Senja Pollak, Marko Robnik-Šikonja, and Simon Krek (2020). Referenčni seznam pogostih splošnih besed za slovenščino. In Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH2020). pp: 10-15.
- Boško Koloski, Senja Pollak, and Blaž Škrlj (2020). Know your Neighbors: Efficient Author Profiling via Follower Tweets. Notebook for PAN at CLEF 2020.
- Boško Koloski, Senja Pollak, and Blaž Škrlj (2020). Multilingual Detection of Fake News Spreaders via Sparse Matrix Factorization. Notebook for PAN at CLEF 2020.
- Emanuela Boros, Elvys Linhares Pontes, Luis Adrian Cabrera-Diego, Ahmed Hamdi, Jose G. Moreno, Nicolas Sidère, and Antoine Doucet (2020). Robust Named Entity Recognition and Linking on Historical Multilingual Documents. Working Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum (CLEF-HIPE 2020).
- Nela Petrželková, Blaž Škrlj, and Nada Lavrač (2020). Knowledge graph aware text classification. In Proceedings of the 23rd International Multiconference – IS2020.
- Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, and Senja Pollak (2020). COVID-19 Therapy Target Discovery with Context-Aware Literature Mining. In Proceedings of the 23rd International Conference on Discovery Science (DS 2020). pp: 109-123.
- Kristiina Vaik, Marit Asula, and Raul Sirel (2020). Hybrid Tagger – An Industry-driven Solution for Extreme Multi-label Text Classification. In Proceedings of the LREC2020 Industry Track. pp:26-30.
- Carlos Santos Armendariz, Matthew Purver, Senja Pollak, Nikola Ljubešić, Matej Ulčar, Marko Robnik-Šikonja, Ivan Vulić, and Mohammed Taher Pilehvar (2020). SemEval2020 Task 3: Graded Word Similarity in Context. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020). pp: 36-49.
- Matej Martinc, Syrielle Montariol, Elaine Zosa, and Lidia Pivovarova (2020). Discovery Team at SemEval-2020 Task 1: Context-sensitive Embeddings not Always Better Than Static for Semantic Change Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020). pp: 67-73.
- George A. Wright and Matthew Purver (2020). Creative Language Generation in a Society of Engagement and Reflection. In Proceedings of the Eleventh International Conference on Computational Creativity (ICCC2020).
- Stephen Mutuvi, Emanuela Boros, Antoine Doucet, Gaël Lejeune, Adam Jatowt, and Moses Odeo (2020). Multilingual Epidemiological Text Classification: A Comparative Study. In Proceedings of the 28th International Conference on Computational Linguistics (COLING2020).
- Jose G. Moreno, Emanuela Boros, and Antoine Doucet (2020). TLR at the NTCIR-15 FinNum-2 Task: Improving Text Classifiers for Numeral Attachment in Financial Social Data. In Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies. pp: 8-11.
- Emanuela Boroş, Ahmed Hamdi, Elvys Linhares Pontes, Luis-Adrián Cabrera-Diego, José G. Moreno, Nicolas Sidere, and Antoine Doucet (2020). Alleviating Digitization Errors in Named Entity Recognition for Historical Documents. In Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020). pp: 431-441.
- Moreno, Jose G., Elvys Linhares Pontes, and Gaël Dias (2020). CTLR@WiC-TSV: Target Sense Verification using Marked Inputs and Pre-trained Models. In 6th Workshop on Semantic Deep Learning (SemDeep-6) associated to 29th International Joint Conference on Artificial Intelligence and 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI 2020).
- Kristian Miok, Gregor Pirš, and Marko Robnik-Šikonja (2020). Bayesian Methods for Semi-supervised Text Annotation. In Proceedings of the 14th Linguistic Annotation Workshop Co-located with COLING 2020.
- Nhu Khoa Nguyen, Emanuela Boroş, Gaël Lejeune, and Antoine Doucet (2020). Impact Analysis of Document Digitization on Event Extraction. In 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2020). 2735, pp: 17-28.
- Jorge Del-Bosque-Trevino, Julian Hough and Matthew Purver (2020). Investigating the Semantic Wave in Tutorial Dialogues: An Annotation Scheme and Corpus Study on Analogy Components. In Proceedings of the 24th SemDial Workshop on the Semantics and Pragmatics of Dialogue.
- Morteza Rohanian, Julian Hough, and Matthew Purver (2020). Multi-modal Fusion with Gating using Audio, Lexical and Disfluency Features for Alzheimer’s Dementia Recognition from Spontaneous Speech. In Proceedings of Interspeech 2020.
- Tom Tabak and Matthew Purver (2020). Temporal Mental Health Dynamics on Social Media. In Proceedings of the 1st Workshop on NLP for COVID-19 at EMNLP 2020.
- Yujian Gan, Matthew Purver, and John Woodward (2020). A Review of Cross-Domain Text-to-SQL Models. In the Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop.
- Matej Ulčar and Marko Robnik-Šikonja (2020). FinEst BERT and CroSloEngual BERT: less is more in multilingual models. In the Proceedings of the 23rd International Conference on Text, Speech and Dialogue. https://doi.org/10.1007/978-3-030-58323-1_11
- Elvys Linhares Pontes, Luis Adrian Cabrera-Diego, Jose G. Moreno, Emanuela Boros, Ahmed Hamdi, Nicolas Sidere, Mickael Coustaty, and Antoine Doucet (2020). Entity Linking for Historical Documents: Challenges and Solutions. In Proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries (ICADL 2020). https://doi.org/10.1007/978-3-030-64452-9_19
- Shane Sheehan, Saturnino Luz, Pierre Albert, and Masood Masoodian (2021). TeMoCo-Doc: A visualization for supporting temporal and contextual analysis of dialogues and associated documents in linguistic tasks. In Proceedings of the International Conference on Advanced Visual Interfaces. https://doi.org/10.1145/3399715.3399956.
- Luis Adrián Cabrera-Diego, Jose G. Moreno, and Antoine Doucet (2021). Simple ways to improve NER in every language using markup. In Proceedings of ECIR 2021.
- Emanuela Boros and Antoine Doucet (2021). Transformer-based Methods for Recognizing Ultra Fine-grained Entities (RUFES). In Proceedings of the Thirteenth Text Analysis Conference (TAC 2020).
- Boshko Koloski, Timen Stepišnik-Perdih, Senja Pollak, and Blaž Škrlj (2021). Identification of COVID-19 Related Fake News via Neural Stacking. Combating Online Hostile Posts in Regional Languages during Emergency Situation (Constraint@AAAI).
- Pelicon Andraž, Ravi Shekhar, Matej Martinc, Blaž Škrlj, Senja Pollak, and Matthew Purver (2021). Zero-shot cross-lingual content filtering: offensive language and hate speech detection. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Aleš Žagar and Marko Robnik-Šikonja (2021). Unsupervised Approach to Cross-Lingual User Comments Summarization. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Miia Rämö and Leo Leppänen (2021). Using contextual and cross-lingual word embeddings to improve variety in template-based NLG for automated journalism. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Matej Martinc, Nina Perger, Andraž Pelicon, Matej Ulčar, Andreja Vezovnik, and Senja Pollak (2021). EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Boshko Koloski, Senja Pollak, Blaž Škrlj, and Matej Martinc (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Boshko Koloski, Elaine Zosa, Timen Stepišnik-Perdih, Blaž Škrlj, Tarmo Paju, and Senja Pollak (2021). Interesting cross-border news discovery using cross-lingual article linking and document similarity. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Enja Kokalj, Blaž Škrlj, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja (2021). BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Shane Sheehan, Saturnino Luz, and Masood Masoodian (2021). TeMoTopic: Temporal Mosaic Visualisation of Topic Distribution, Keywords, and Context. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Andraž Repar and Andrej Shumakov (2021). Aligning Estonian and Russian news industry keywords with the help of subtitle translations and an environmental thesaurus. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Blaž Škrlj, Shane Sheehan, Nika Eržen, Marko Robnik-Šikonja, Saturnino Luz, and Senja Pollak (2021). Exploring Neural Language Models via Analysis of Local and Global Self-Attention Spaces. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Senja Pollak, Marko Robnik-Šikonja, Matthew Purver, Michele Boggia, Ravi Shekhar, Marko Pranjić, Salla Salmela, Ivar Krustok, Tarmo Paju, Carl-Gustav Linden, Leo Leppänen, Elaine Zosa, Matej Ulčar, Linda Freienthal, Silver Traat, Luis Adrián Cabrera-Diego, Matej Martinc, Nada Lavrač, Blaž Škrlj, Martin Žnidaršič, Andraž Pelicon, Boshko Koloski, Vid Podpečan, Janez Kranjc, Shane Sheehan, Emanuela Boros, Jose G. Moreno, Antoine Doucet, and Hannu Toivonen (2021). EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions. In the Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL2021).
- Anita Valmarska, Luis Adrian Cabrera-Diego, Elvys Linhares Pontes, and Senja Pollak (2021). Exploratory analysis of news sentiment using subgroup discovery. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing in conjunction with EACL2021.
- Luis Adrian Cabrera-Diego, Jose G. Moreno, and Antoine Doucet (2021). Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing in conjunction with EACL2021.
- Jakub Piskorski, Bogdan Babych, Zara Kancheva, Olga Kanishcheva, Maria Lebedeva, Michał Marcinczuk, Preslav Nakov, Petya Osenova, Lidia Pivovarova, Senja Pollak, Pavel Pribán, Ivaylo Radev, Marko Robnik-Šikonja, Vasyl Starko, Josef Steinberger, and Roman Yangarber (2021). Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing in conjunction with EACL2021.
- José G. Moreno, Antoine Doucet, and Brigitte Grau (2021). Relation Classification via Relation Validation. Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6).
- Emanuela Boros, Ahmed Hamdi, Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, Jose G. Moreno, Nicolas Sidere, and Antoine Doucet (2021). Atténuer les erreurs de numérisation dans la reconnaissance d’entités nommées pour les documents historiques. In the Proceedings of CORIA 2021.
- Stephen Mutuvi, Emanuela Boros, Antoine Doucet, Gaël Lejeune, Adam Jatowt, Moses Odeo (2021). Étude comparative de méthodes de classification multilingue appliquées à l’épidémiologie. In the Proceedings of CORIA 2021.
- Nhu Khoa Nguyen, Emanuela Boros, Gaël Lejeune, Antoine Doucet, and Thierry Delahaut (2021). L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers. In Proceedings of the 1st Workshop on Financial Technology on the Web (FinWeb) with FinSim-2 and FinSBD-3 Shared Task, in conjunction with WWW ’21: The Web Conference 2021.
- Leo Leppänen and Hannu Toivonen (2021). A Baseline Document Planning Method for Automated Journalism. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
- Emanuela Boros, Romaric Besançon, Olivier Ferret, and Brigitte Grau (2021). Intérêt des modèles de caractères pour la détection d’événements. In Proceedings of TALN 2021.
- Andraž Repar, Matej Martinc, Matej Ulčar, and Senja Pollak (2021). Word-embedding based bilingual terminology alignment. In the Proceedings of eLex 2021.
- Marko Pranjić, Marko Robnik-Šikonja, and Senja Pollak (2021). An evaluation of BERT and Doc2Vec model on the IPTC Subject Codes prediction dataset. In the Proceedings of the 24th International Multiconference – IS2021 (SiKDD).
- Matej Ulčar and Marko Robnik-Šikonja (2021). SloBERTa: Slovene monolingual large pretrained masked language model. In the Proceedings of the 24th International Multiconference – IS2021 (SiKDD).
- Mojca Brglez, Senja Pollak, and Špela Vintar (2021). Simple discovery of COVID ISWAR Metaphors Using Word Embeddings. In the Proceedings of the 24th International Multiconference – IS2021 (SiKDD).
- Quan Duong, Lidia Pivovarova and Elaine Zosa (2021). Benchmarks for Unsupervised Discourse Change Detection. In the Proceedings of the Histoinformatics workshop 2021.
- Stephen Mutuvi, Emanuela Boros, Antoine Doucet, Gaël Lejeune, Adam Jatowt, and Moses Odeo (2021). Token-level Multilingual Epidemic Dataset for Event Extraction. In the Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, TPDL2021.
- Andraž Pelicon, Blaž Škrlj, and Petra Kralj Novak (2021). Automated Hate Speech Target Identification. In the Proceedings of the 24th International Multiconference – IS2021 (Slovenian Conference on Artificial Intelligence).
- Aleš Žagar, Matic Kavaš, and Marko Robnik-Šikonja (2021). Corpus KAS 2.0: Cleaner and with New Datasets. In the Proceedings of the 24th International Multiconference – IS2021 (Slovenian Conference on Artificial Intelligence).
- Khalid Alnajjar and Mika Hämäläinen (2021). When a Computer Cracks a Joke: Automated Generation of Humorous Headlines. In the Proceedings of the 12th International Conference on Computational Creativity (ICCC21).
- Syrielle Montariol, Matej Martinc, and Lidia Pivovarova (2021). Scalable and Interpretable Semantic Change Detection. In the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Emanuela Boros, Jose G. Moreno, and Antoine Doucet (2021). Event Detection with Entity Markers. In the Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021).
- Leo Leppänen and Hannu Toivonen (2021). A Baseline Document Planning Method for Automated Journalism. In the Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa).
- Andrey Kutuzov and Lidia Pivovarova (2021). Three-part diachronic semantic change dataset for Russian. In the Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021.
- Emiel van Miltenburg, Miruna Clinciu, Ondřej Dušek, Dimitra Gkatzia, Stephanie Inglis, Leo Leppänen, Saad Mahamood, Emma Manning, Stephanie Schoch, Craig Thomson, and Luou Wen (2021). Underreporting of errors in NLG output, and what to do about it. In the Proceedings of the 14th International Conference on Natural Language Generation.
- Elaine Zosa, Ravi Shekhar, Mladen Karan, and Matthew Purver (2021). Not all comments are equal: Insights into comment moderation from a topic-aware model. In the Proceedings of Recent Advances in Natural Language Processing (RANLP2021).
- Senja Pollak, Matej Martinc, Andraž Pelicon, Matej Ulčar, and Andreja Vezovnik (2021). COVID-19 v slovenskih spletnih medijih: analiza s pomočjo računalniške obdelave jezika. In Pandemična družba: slovensko sociološko srečanje.
- Jani Marjanen, Elaine Zosa, Simon Hengchen, Lidia Pivovarova, and Mikko Tolonen (2021). Topic modelling discourse dynamics in historical newspapers. In the Post-Proceedings of the DHN2020 Conference: the 5th conference on Digital Humanities in the Nordic Countries.
- George A. Wright and Matthew Purver (2021). Parsing Text in a Workspace for Language Generation. In the Proceedings of the 2021 Society for Text & Discourse Annual Conference, 2021.
- George A. Wright and Matthew Purver (2021). Evaluating Natural Language Descriptions Generated in a Workspace-Based Architecture. In the Proceedings of the 12th International Conference on Computational Creativity, ICCC2021.
- Blaž Škrlj, Marko Jukič, Nika Eržen, Senja Pollak, and Nada Lavrač (2021). Prioritization of COVID-19-related literature via unsupervised keyphrase extraction and document representation learning. In the Proceedings of the 24th International Conference on Discovery Science (DS2021).
- Ilija Tavchioski, Boshko Koloski, Blaž Škrlj, and Senja Pollak (2021). Multi-label classification of COVID-19-related articles with an autoML approach. In Proceedings of the BioCreative VII Challenge Evaluation Workshop.
- Tran Thi Hong Hanh, Antoine Doucet, Nicolas Sidere, Jose G. Moreno, and Senja Pollak (2021). Named Entity Recognition Architecture Combining Contextual and Global Features. In the Proceedings of the 23rd International Conference on Asia-Pacific Digital Libraries (ICADL 2021). doi: https://doi.org/10.1007/978-3-030-91669-5_21 .
- Matej Ulčar and Marko Robnik-Šikonja (2021). Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages. In the Proceedings of the 10th International Conference on Analysis of Images, Social Networks and Texts (AIST 2021).
- Mario Giulianelli, Andrey Kutuzov, and Lidia Pivovarova (2021). Grammatical Profiling for Semantic Change Detection. In the Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021).
- Elaine Zosa, Stephen Mutuvi, Mark Granroth-Wilding and Antoine Doucet (2021). Evaluating the Robustness of Embedding-based Topic Models to OCR Noise. In the Proceedings of the 23rd International Conference on Asia-Pacific Digital Libraries (ICADL 2021). doi: https://doi.org/10.1007/978-3-030-91669-5_30.
- Anka Supej, Matej Ulčar, Marko Robnik-Šikonja, and Senja Pollak (2021). Primerjava slovenskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. In the Proceedings of the Conference on Language Technologies and Digital Humanities (JTDH 2021).
- Stephen Mutuvi, Emanuela Boros, Antoine Doucet, Gaël Lejeune, Adam Jatowt, and Moses Odeo (2021). Multilingual Epidemic Event Extraction. In the Proceedings of ICADL 2021. https://doi.org/10.1007/978-3-030-91669-5_12
- Timen Stepišnik Perdih, Nada Lavrač, and Blaž Škrlj (2021). Semantic Reasoning from Model-Agnostic Explanations. In the Proceedings of the 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI). https://doi.org/10.1109/SAMI50585.2021.9378668
- Lidia Pivovarova and Elaine Zosa (2021). Visual Topic Modelling for NewsImage Task at MediaEval 2021. MediaEval 2021 Multimedia Benchmark Workshop : Working Notes Proceedings of the MediaEval 2021 Workshop.
- Jan Štihec, Senja Pollak, and Martin Žnidaršič (2021). Preliminary experimentation with combinations and extensions of forward-looking sentence detection wordlists. In Proceedings of the 3rd financial narrative processing workshop.
- Emanuela Boros, Romaric Besançon, Olivier Ferret, and Brigitte Grau (2021). The Importance of Character-Level Information in an Event Detection Model. In the Proceedings of NLDB 2021.
- Kristian Miok, Blaž Škrlj, Daniela Zaharie, and Marko Robnik-Šikonja (2021). Bayesian BERT for Trustful Hate Speech Detection. ICML 2020 Workshop on Uncertainty & Robustness in Deep Learning.
- Larisa Grčić Simeunović, Matej Martinc, and Špela Vintar (2020). A bilingual approach to specialised adjectives through word embeddings in the karstology domain. In the Proceedings of TOTH 2020.
Book/Monograph
- Carl-Gustav Linden (2020). Silicon Valley och makten över medierna [Silicon Valley and the power over media]. Nordicom.
- Hannu Toivonen and Michele Boggia (2021). Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. Association for Computational Linguistics. ISBN 978-1-954085-13-8
Thesis
- David Hargrave (2021). Mitigating Gender Bias in Word Embeddings using Explicit Gender Free Corpus. Masters thesis, School of Electronic Engineering and Computer Science, Queen Mary University of London.
Deliverables
Below is a list of submitted public deliverables of EMBEDDIA.
WP7 public deliverables | Due date |
D7.1: Project website and social media accounts (T7.1) | 31/03/2019 |
D7.4: Selected EMBEDDIA components in ClowdFlows (T7.4) | 31/12/2020 |
D7.6: Reusable EMBEDDIA components available through the ClowdFlows web interface (T7.4) | 28/02/2022 |