ECIR2023

November 8, 2025 · View on GitHub

会议论文列表

本会议共有 169 篇论文

序号	链接	摘要	作者	组织
1	Investigating Conversational Search Behavior for Domain Exploration	Conversational search has evolved as a new information retrieval paradigm, marking a shift from traditional search systems towards interactive dialogues with intelligent search agents. This change especially affects exploratory information-seeking contexts, where conversational search systems can guide the discovery of unfamiliar domains. In these scenarios, users find it often difficult to express their information goals due to insufficient background knowledge. Conversational interfaces can provide assistance by eliciting information needs and narrowing down the search space. However, due to the complexity of information-seeking behavior, the design of conversational interfaces for retrieving information remains a great challenge. Although prior work has employed user studies to empirically ground the system design, most existing studies are limited to well-defined search tasks or known domains, thus being less exploratory in nature. Therefore, we conducted a laboratory study to investigate open-ended search behavior for navigation through unknown information landscapes. The study comprised of 26 participants who were restricted in their search to a text chat interface. Based on the collected dialogue transcripts, we applied statistical analyses and process mining techniques to uncover general information-seeking patterns across five different domains. We not only identify core dialogue acts and their interrelations that enable users to discover domain knowledge, but also derive design suggestions for conversational search systems.	Anum Afzal, Daniel Braun, Florian Matthes, Juraj Vladika, Phillip Schneider	Technical University of Munich; University of Twente
2	COILcr: Efficient Semantic Matching in Contextualized Exact Match Retrieval	Lexical exact match systems that use inverted lists are a fundamental text retrieval architecture. A recent advance in neural IR, COIL, extends this approach with contextualized inverted lists from a deep language model backbone and performs retrieval by comparing contextualized query-document term representation, which is effective but computationally expensive. This paper explores the effectiveness-efficiency tradeoff in COIL-style systems, aiming to reduce the computational complexity of retrieval while preserving term semantics. It proposes COILcr, which explicitly factorizes COIL into intra-context term importance weights and cross-context semantic representations. At indexing time, COILcr further maps term semantic representations to a smaller set of canonical representations. Experiments demonstrate that canonical representations can efficiently preserve term semantics, reducing the storage and computational cost of COIL-based retrieval while maintaining model performance. The paper also discusses and compares multiple heuristics for canonical representation selection and looks into its performance in different retrieval settings.	Jamie Callan, Luyu Gao, Rohan Jha, Zhen Fan	Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
3	Item Graph Convolution Collaborative Filtering for Inductive Recommendations	Graph Convolutional Networks (GCN) have been recently employed as core component in the construction of recommender system algorithms, interpreting user-item interactions as the edges of a bipartite graph. However, in the absence of side information , the majority of existing models adopt an approach of randomly initialising the user embeddings and optimising them throughout the training process. This strategy makes these algorithms inherently transductive , curtailing their ability to generate predictions for users that were unseen at training time. To address this issue, we propose a convolution-based algorithm, which is inductive from the user perspective, while at the same time, depending only on implicit user-item interaction data. We propose the construction of an item-item graph through a weighted projection of the bipartite interaction network and to employ convolution to inject higher order associations into item embeddings, while constructing user representations as weighted sums of the items with which they have interacted. Despite not training individual embeddings for each user our approach achieves state-of-the-art recommendation performance with respect to transductive baselines on four real-world datasets, showing at the same time robust inductive performance.	Aonghus Lawlor, Barry Smyth, Edoardo D'Amico, Elias Z. Tragos, Khalil Muhammad, Neil Hurley	Insight Centre for Data Analytics
4	Dynamic Exploratory Search for the Information Retrieval Anthology	This paper presents dynamic exploratory search technology for the analysis of scientific corpora. The unique dynamic features of the system allow users to analyze quantitative corpus statistics beyond document counts, and to switch between corpus exploration and corpus filtering. To demonstrate the innovation of our approach, we apply our technology to the IR Anthology, a comprehensive corpus of information retrieval publications. We showcase, among others, how to query for potential PC members and the “Salton number” of an author.	Benno Stein, Jason Brockmeyer, Martin Potthast, Tim Gollub	Bauhaus-Universität Weimar; Leipzig University
5	A Study of Term-Topic Embeddings for Ranking	Contextualized representations from transformer models have significantly improved the performance of neural ranking models. Late interactions popularized by ColBERT and recently compressed with clustering in ColBERTv2 deliver state-of-the-art quality on many benchmarks. ColBERTv2 uses centroids along with occurrence-specific delta vectors to approximate contextualized embeddings without reducing ranking effectiveness. Analysis of this work suggests that these centroids are “term-topic embeddings”. We examine whether term-topic embeddings can be created in a differentiable end-to-end way, finding that this is a viable strategy for removing the separate clustering step. We investigate the importance of local context for contextualizing these term-topic embeddings, analogous to refining centroids with delta vectors. We find this end-to-end approach is sufficient for matching the effectiveness of the original contextualized embeddings.	Andrew Yates, Lila Boualili	Max Planck Institute for Informatics
6	De-biasing Relevance Judgements for Fair Ranking	The objective of this paper is to show that it is possible to significantly reduce stereotypical gender biases in neural rankers without modifying the ranking loss function, which is the current approach in the literature. We systematically de-bias gold standard relevance judgement datasets with a set of balanced and well-matched query pairs. Such a de-biasing process will expose neural rankers to comparable queries from across gender identities that have associated relevant documents with compatible degrees of gender bias. Therefore, neural rankers will learn not to associate varying degrees of bias to queries from certain gender identities. Our experiments show that our approach is able to (1) systematically reduces gender biases associated with different gender identities, and (2) at the same time maintain the same level of retrieval effectiveness.	Amin Bigdeli, Bhaskar Mitra, Ebrahim Bagheri, Morteza Zihayat, Negar Arabzadeh, Shirin Seyedsalehi	Microsoft Res, Montreal, PQ, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Waterloo, Waterloo, ON, Canada
7	ColBERT-FairPRF: Towards Fair Pseudo-Relevance Feedback in Dense Retrieval	Pseudo-relevance feedback mechanisms have been shown to be useful in improving the effectiveness of search systems for retrieving the most relevant items in response to a user's query. However, there has been little work investigating the relationship between pseudo-relevance feedback and fairness in ranking. Indeed, using the feedback from an initial retrieval to revise a query can in principle also allow to optimise objectives beyond relevance, such as the fairness of the search results. In this work, we show how a feedback mechanism based on the successful ColBERT-PRF model can be used for retrieving fairer search results. Therefore, we propose a novel fair feedback mechanism for multiple representation dense retrieval (ColBERT-FairPRF), which enhances the distribution of exposure over groups of documents in the search results by fairly extracting the feedback embeddings that are added to the user's query representation. To fairly extract representative embeddings, we apply a clustering approach since traditional methods based on counting are not applicable in the dense retrieval space. Our results on the 2021 TREC Fair Ranking Track test collection demonstrate the effectiveness of our method compared to ColBERT-PRF, with statistical significant improvements of up to similar to 19% in AttentionWeighted Ranked Fairness. To the best of our knowledge, ColBERT-FairPRF is the first query expansion method for fairness in multiple representation dense retrieval.	Graham McDonald, Iadh Ounis, Thomas Jänich	Univ Glasgow, Glasgow, Scotland
8	Keyword Embeddings for Query Suggestion	Nowadays, search engine users commonly rely on query suggestions to improve their initial inputs. Current systems are very good at recommending lexical adaptations or spelling corrections to users’ queries. However, they often struggle to suggest semantically related keywords given a user’s query. The construction of a detailed query is crucial in some tasks, such as legal retrieval or academic search. In these scenarios, keyword suggestion methods are critical to guide the user during the query formulation. This paper proposes two novel models for the keyword suggestion task trained on scientific literature. Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents’ keyword co-occurrence. Along with these models, we also present a specially tailored negative sampling approach that exploits how keywords appear in academic publications. We devise a ranking-based evaluation methodology following both known-item and ad-hoc search scenarios. Finally, we evaluate our proposals against the state-of-the-art word and sentence embedding models showing considerable improvements over the baselines for the tasks.	Javier Parapar, Jorge Gabín, M. Eduardo Ares	Linknovate Science; University of A Coruña
9	Contrasting Neural Click Models and Pointwise IPS Rankers	Inverse-propensity scoring and neural click models are two popular methods for learning rankers from user clicks that are affected by position bias. Despite their prevalence, the two methodologies are rarely directly compared on equal footing. In this work, we focus on the pointwise learning setting to compare the theoretical differences of both approaches and present a thorough empirical comparison on the prevalent semi-synthetic evaluation setup in unbiased learning-to-rank. We show theoretically that neural click models, similarly to IPS rankers, optimize for the true document relevance when the position bias is known. However, our work also finds small but significant empirical differences between both approaches indicating that neural click models might be affected by position bias when learning from shared, sometimes conflicting, features instead of treating each document separately.	Maarten de Rijke, Onno Zoeter, Philipp Hager	Booking.com; University of Amsterdam
10	CoSPLADE: Contextualizing SPLADE for Conversational Information Retrieval	Conversational search is a difficult task as it aims at retrieving documents based not only on the current user query but also on the full conversation history. Most of the previous methods have focused on a multi-stage ranking approach relying on query reformulation, a critical intermediate step that might lead to a sub-optimal retrieval. Other approaches have tried to use a fully neural IR first-stage, but are either zero-shot or rely on full learning-to-rank based on a dataset with pseudo-labels. In this work, leveraging the CANARD dataset, we propose an innovative lightweight learning technique to train a first-stage ranker based on SPLADE. By relying on SPLADE sparse representations, we show that, when combined with a second-stage ranker based on T5Mono, the results are competitive on the TREC CAsT 2020 and 2021 tracks. The source code is available at https://github.com/nam685/cosplade.git .	Benjamin Piwowarski, JianYun Nie, Laure Soulier, Nam Le Hai, Thibault Formal, Thomas Gerald	Sorbonne Université, CNRS, ISIR; University of Montreal; Université Paris-Saclay, CNRS, SATT Paris Saclay, LISN
11	Investigating Conversational Agent Action in Legal Case Retrieval	Legal case retrieval is a specialized IR task aiming to retrieve supporting cases given a query case. Existing work has shown that the conversational search paradigm can improve users' search experience in legal case retrieval with humans as intermediary agents. To move further towards a practical system, it is essential to decide what action a computer agent should take in conversational legal case retrieval. Existing works try to finish this task through Transformer-based models based on semantic information in open-domain scenarios. However, these methods ignore search behavioral information, which is one of the most important signals for understanding the information-seeking process and improving legal case retrieval systems. Therefore, we investigate the conversational agent action in legal case retrieval from the behavioral perspective. Specifically, we conducted a lab-based user study to collect user and agent search behavior while using agent-mediated conversational legal case retrieval systems. Based on the collected data, we analyze the relationship between historical search interaction behaviors and current agent actions in conversational legal case retrieval. We find that, with the increase of agent-user interaction behavioral indicators, agents are increasingly inclined to return results rather than clarify users' intent, but the probability of collecting candidates does not change significantly. With the increase of the interactions between the agent and the system, agents are more inclined to collect candidates than clarify users' intent and are more inclined to return results than collect candidates. We also show that the agent action prediction performance can be improved with both semantic and behavioral features. We believe that this work can contribute to a better understanding of agent action and useful guidance for developing practical systems for conversational legal case retrieval.	Bulou Liu, Chenliang Li, Fan Zhang, Min Zhang, Shaoping Ma, Weixing Shen, Yiqun Liu, Yiran Hu, Yueyue Wu	Institute for Internet Judiciary, Tsinghua University; Tsinghua University; Wuhan University
12	Entity Embeddings for Entity Ranking: A Replicability Study	Knowledge Graph embeddings model semantic and structural knowledge of entities in the context of the Knowledge Graph. A nascent research direction has been to study the utilization of such graph embeddings for the IR-centric task of entity ranking. In this work, we replicate the GEEER study of Gerritse et al. [ 9 ] which demonstrated improvements of Wiki2Vec embeddings on entity ranking tasks on the DBpediaV2 dataset. We further extend the study by exploring additional state-of-the-art entity embeddings ERNIE [ 27 ] and E-BERT [ 19 ], and by including another test collection, TREC CAR, with queries not about person, location, and organization entities. We confirm the finding that entity embeddings are beneficial for the entity ranking task. Interestingly, we find that Wiki2Vec is competitive with ERNIE and E-BERT. Our code and data to aid reproducibility and further research is available at https://github.com/poojahoza/E3R-Replicability .	Laura Dietz, Pooja Oza	University of New Hampshire
13	Conversational Search for Multimedia Archives	The growth of media archives (including text, speech, video and audio) has led to significant interest in developing search methods for multimedia content. An ongoing challenge of multimedia search is user interaction during the search process, including specification of search queries, presentation of retrieved content and user feedback. In parallel with this, recent years have seen increasing interest in conversational search methods enabling users to engage in a dialogue with an AI agent that supports their search activities. Conversational search seeks to enable users to find useful content more easily, quickly and reliably. To date, research in conversational search has focused on text archives. This project explores the integration of conversational search methods within multimedia search.	Anastasia Potyagalova	Dublin City Univ, Sch Comp, ADAPT Ctr, Dublin 9, Ireland
14	Designing Useful Conversational Interfaces for Information Retrieval in Career Decision-Making Support	The proposal is an interdisciplinary problem-focused study to explore the usefulness of conversational information retrieval (CIR) in a complex domain. A research-through-design methodology will be used to identify the informational, practical, affective, and ethical requirements for a CIR system in the specific context of Career Education, Information, Advice & Guidance (CEIAG) services for young people in Scotland. Later phases of the research will use these criteria to identify appropriate techniques in the literature, and design and evaluate artefacts intended to meet these. This research will use an interdisciplinary approach to further understanding on the use and limitations of dialogue systems as intermediaries for information retrieval where there are a wide range of possible information tasks and specific users’ needs may be ambiguous.	Marianne Wilson	Edinburgh Napier University
15	ImageCLEF 2023 Highlight: Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications	In this paper, we provide an overview of the upcoming ImageCLEF campaign. ImageCLEF is part of the CLEF Conference and Labs of the Evaluation Forum since 2003. ImageCLEF, the Multimedia Retrieval task in CLEF, is an ongoing evaluation initiative that promotes the evaluation of technologies for annotation, indexing, and retrieval of multimodal data with the aim of providing information access to large collections of data in various usage scenarios and domains. In its 21st edition, ImageCLEF 2023 will have four main tasks: (i) a Medical task addressing automatic image captioning, synthetic medical images created with GANs, Visual Question Answering for colonoscopy images, and medical dialogue summarization; (ii) an Aware task addressing the prediction of real-life consequences of online photo sharing; (iii) a Fusion task addressing late fusion techniques based on the expertise of a pool of classifiers; and (iv) a Recommending task addressing cultural heritage content-recommendation. In 2022, ImageCLEF received the participation of over 25 groups submitting more than 258 runs. These numbers show the impact of the campaign. With the COVID-19 pandemic now over, we expect that the interest in participating, especially at the physical CLEF sessions, will increase significantly in 2023.	Adrian Popescu, Ahmad IdrissiYaghir, Alba García Seco de Herrera, Alexandra Andrei, Alexandru Stan, AnaMaria Claudia Dragulinescu, Andrea M. Storås, Asma Ben Abacha, Bogdan Ionescu, Christoph M. Friedrich, George Ioannidis, Griffin Adams, Henning Müller, Henning Schäfer, Hugo Manguinhas, Ihar Filipovich, Ioan Coman, Johanna Schöler, Johannes Rückert, Jérôme Deshayes, LiviuDaniel Stefan, Louise Bloch, Meliha Yetisgen, Michael A. Riegler, Mihai Dogariu, Mihai Gabriel Constantin, Neal Snider, Nikolaos Papachrysos, Pål Halvorsen, Raphael Brüngel, Serge Kozlovski, Steven Hicks, Thomas de Lange, Vajira Thambawita, Vassili Kovalev, Wenwai Yim	Belarus State University; Belarusian Academy of Sciences; CEA LIST; Columbia University; Europeana Foundation; IN2 Digital Innovations; Microsoft; Politehnica University of Bucharest; Sahlgrenska University Hospital; SimulaMet; University Hospital Essen; University of Applied Sciences Western Switzerland (HES-SO); University of Applied Sciences and Arts Dortmund; University of Essex; University of Washington
16	Parameter-Efficient Sparse Retrievers and Rerankers Using Adapters	Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR. (The code can be found at: https://github.com/naver/splade/tree/adapter-splade .)	Carlos Lassance, Hervé Déjean, Stéphane Clinchant, Vaishali Pal	Naver Labs Europe; University of Amsterdam
17	Topic-Enhanced Personalized Retrieval-Based Chatbot	Building a personalized chatbot has drawn much attention recently. A personalized chatbot is considered to have a consistent personality. There are two types of methods to learn the personality. The first mainly model the personality from explicit user profiles ( e.g. , manually created persona descriptions). The second learn implicit user profiles from the user’s dialogue history, which contains rich, personalized information. However, a user’s dialogue history can be long and noisy as it contains long-time, multi-topic historical dialogue records. Such data noise and redundancy impede the model’s ability to thoroughly and faithfully learn a consistent personality, especially when applied with models that have an input length limit ( e.g. , BERT). In this paper, we propose deconstructing the long and noisy dialogue history into topic-dependent segments. We only use the topically related dialogue segment as context to learn the topic-aware user personality. Specifically, we design a Top ic-enhanced personalized R etrieval-based C hatbot, TopReC. It first deconstructs the dialogue history into topic-dependent dialogue segments and filters out irrelevant segments to the current query via a Heter-Merge-Reduce framework. It then measures the matching degree between the response candidates and the current query conditioned on each topic-dependent segment. We consider the matching degree between the response candidate and the cross-topic user personality. The final matching score is obtained by combining the topic-dependent and cross-topic matching scores. Experimental results on two large dataset show that TopReC outperforms all previous state-of-the-art methods.	Hongjin Qian, Zhicheng Dou	Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
18	Investigating the Impact of Query Representation on Medical Information Retrieval	This study investigates the effect that various patient-related information extracted from unstructured clinical notes has on two different tasks, i.e., patient allocation in clinical trials and medical literature retrieval. Specifically, we combine standard and transformer-based methods to extract entities (e.g., drugs, medical problems), disambiguate their meaning (e.g., family history, negations), or expand them with related medical concepts to synthesize diverse query representations. The empirical evaluation showed that certain query representations positively affect retrieval effectiveness for patient allocation in clinical trials, but no statistically significant improvements have been identified in medical literature retrieval. Across the queries, it has been found that removing negated entities using a domain-specific pre-trained transformer model has been more effective than a standard rule-based approach. In addition, our experiments have shown that removing information related to family history can further improve patient allocation in clinical trials.	Arjen P. de Vries, Daria Alexander, Gabriella Pasi, Georgios Peikos	Radboud Univ Nijmegen, Nijmegen, Netherlands; Univ Milano Bicocca, Milan, Italy
19	Learning Query-Space Document Representations for High-Recall Retrieval	Recent studies have shown that significant performance improvements reported by neural rankers do not necessarily extend to a diverse range of queries. There is a large set of queries that cannot be effectively addressed by neural rankers primarily because relevant documents to these queries are not identified by first-stage retrievers. In this paper, we propose a novel document representation approach that represents documents within the query space, and hence increases the likelihood of recalling a higher number of relevant documents. Based on experiments on the MS MARCO dataset as well as the hardest subset of its queries, we find that the proposed approach shows synergistic behavior to existing neural rankers and is able to increase recall both on MS MARCO dev set queries as well as the hardest queries of MS MARCO.	Ebrahim Bagheri, Fattane Zarrinkalam, Morteza Zihayat, Negar Arabzadeh, Sara Salamat	Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Guelph, Guelph, ON, Canada; Univ Waterloo, Waterloo, ON, Canada
20	CPR: Cross-Domain Preference Ranking with User Transformation	Data sparsity is a well-known challenge in recommender systems. One way to alleviate this problem is to leverage knowledge from relevant domains. In this paper, we focus on an important real-world scenario in which some users overlap two different domains but items of the two domains are distinct. Although several studies leverage side information (e.g., user reviews) for cross-domain recommendation, side information is not always available or easy to obtain in practice. To this end, we propose cross-domain preference ranking (CPR) with a simple yet effective user transformation that leverages only user interactions with items in the source and target domains to transform the user representation. Given the proposed user transformation, CPR not only successfully enhances recommendation performance for users having interactions with target-domain items but also yields superior performance for cold-start users in comparison with state-of-the-art cross-domain recommendation approaches. Extensive experiments conducted on three pairs of cross-domain recommendation datasets demonstrate the effectiveness of the proposed method in comparison with existing cross-domain recommendation approaches. Our codes are available at https://github.com/cnclabs/codes.crossdomain.rec .	ChiaYu Yeh, ChuanJu Wang, HsienHao Chen, JingKai Lou, MingFeng Tsai, TungLin Wu, YuTing Huang	Academia Sinica; KKStream Limited; National Chengchi University; National Taiwan University and Academia Sinica
21	User Requirement Analysis for a Real-Time NLP-Based Open Information Retrieval Meeting Assistant	Meetings are recurrent organizational tasks intended to drive progress in an interdisciplinary and collaborative manner. They are, however, prone to inefficiency due to factors such as differing knowledge among participants. The research goal of this paper is to design a recommendation-based meeting assistant that can improve the efficiency of meetings by helping to contextualize the information being discussed and reduce distractions for listeners. Following a Wizard-of-Oz setup, we gathered user feedback by thematically analyzing focus group discussions and identifying this kind of system’s key challenges and requirements. The findings point to shortcomings in contextualization and raise concerns about distracting listeners from the main content. Based on the findings, we have developed a set of design recommendations that address context, interactivity and personalization issues. These recommendations could be useful for developing a meeting assistant that is tailored to the needs of meeting participants, thereby helping to optimize the meeting experience.	Amro Najjar, Benoît Alcaraz, Kerstin BongardBlanchy, Nina HosseiniKivanani	Luxembourg Inst Sci & Technol LIST, Esch Sur Alzette, Luxembourg; Univ Luxembourg, Esch Sur Alzette, Luxembourg
22	Self-supervised Contrastive BERT Fine-tuning for Fusion-Based Reviewed-Item Retrieval	As natural language interfaces enable users to express increasingly complex natural language queries, there is a parallel explosion of user review content that can allow users to better find items such as restaurants, books, or movies that match these expressive queries. While Neural Information Retrieval (IR) methods have provided state-of-the-art results for matching queries to documents, they have not been extended to the task of Reviewed-Item Retrieval (RIR), where query-review scores must be aggregated (or fused) into item-level scores for ranking. In the absence of labeled RIR datasets, we extend Neural IR methodology to RIR by leveraging self-supervised methods for contrastive learning of BERT embeddings for both queries and reviews. Specifically, contrastive learning requires a choice of positive and negative samples, where the unique two-level structure of our item-review data combined with meta-data affords us a rich structure for the selection of these samples. For contrastive learning in a Late Fusion scenario (where we aggregate query-review scores into item-level scores), we investigate the use of positive review samples from the same item and/or with the same rating, selection of hard positive samples by choosing the least similar reviews from the same anchor item, and selection of hard negative samples by choosing the most similar reviews from different items. We also explore anchor sub-sampling and augmenting with meta-data. For a more end-to-end Early Fusion approach, we introduce contrastive item embedding learning to fuse reviews into single item embeddings. Experimental results show that Late Fusion contrastive learning for Neural RIR outperforms all other contrastive IR configurations, Neural IR, and sparse retrieval baselines, thus demonstrating the power of exploiting the two-level structure in Neural RIR approaches as well as the importance of preserving the nuance of individual review content via Late Fusion methods.	Ali Pesaranghader, Anton Korikov, Armin Toroghi, Borislav Mavrin, Manasa Bharadwaj, Mohammad Mahdi Abdollah Pour, Parsa Farinneya, Scott Sanner, Touqir Sajed	LG Electronics, Toronto AI Lab; University of Toronto
23	Exploiting Graph Structured Cross-Domain Representation for Multi-domain Recommendation	Multi-domain recommender systems benefit from cross-domain representation learning and positive knowledge transfer. Both can be achieved by introducing a specific modeling of input data (i.e. disjoint history) or trying dedicated training regimes. At the same time, treating domains as separate input sources becomes a limitation as it does not capture the interplay that naturally exists between domains. In this work, we efficiently learn multi-domain representation of sequential users' interactions using graph neural networks. We use temporal intra- and inter-domain interactions as contextual information for our method called MAGRec (short for Multi-domAin Graph-based Recommender). To better capture all relations in a multi-domain setting, we learn two graph-based sequential representations simultaneously: domain-guided for recent user interest, and general for long-term interest. This approach helps to mitigate the negative knowledge transfer problem from multiple domains and improve overall representation. We perform experiments on publicly available datasets in different scenarios where MAGRec consistently outperforms state-of-the-art methods. Furthermore, we provide an ablation study and discuss further extensions of our method.	Alejandro ArizaCasabona, Bartlomiej Twardowski, Tri Kurniawan Wijaya	Huawei Ireland Res Ctr, Dublin, Ireland; UAB, Comp Vis Ctr, Barcelona, Spain; Univ Barcelona, Barcelona, Spain
24	Graph-Based Recommendation for Sparse and Heterogeneous User Interactions	Recommender system research has oftentimes focused on approaches that operateon large-scale datasets containing millions of user interactions. However, manysmall businesses struggle to apply state-of-the-art models due to their verylimited availability of data. We propose a graph-based recommender model whichutilizes heterogeneous interactions between users and content of differenttypes and is able to operate well on small-scale datasets. A genetic algorithmis used to find optimal weights that represent the strength of the relationshipbetween users and content. Experiments on two real-world datasets (which wemake available to the research community) show promising results (up to 7improvement), in comparison with other state-of-the-art methods for low-dataenvironments. These improvements are statistically significant and consistentacross different data samples.	Christina Lioma, Kacper Kenji Lesniak, Maria Maistro, Mirko Biasini, Panagiotis Filianos, Simone Borg Bruun, Vittorio Carmignani	FullBrain; University of Copenhagen
25	Listwise Explanations for Ranking Models Using Multiple Explainers	This paper proposes a novel approach towards better interpretability of a trained text-based ranking model in a post-hoc manner. A popular approach for post-hoc interpretability text ranking models are based on locally approximating the model behavior using a simple ranker. Since rankings have multiple relevance factors and are aggregations of predictions, existing approaches that use a single ranker might not be sufficient to approximate a complex model, resulting in low fidelity. In this paper, we overcome this problem by considering multiple simple rankers to better approximate the entire ranking list from a black-box ranking model. We pose the problem of local approximation as a GENERALIZED PREFERENCE COVERAGE (GPC) problem that incorporates multiple simple rankers towards the listwise explanation of ranking models. Our method MULTIPLEX uses a linear programming approach to judiciously extract the explanation terms, so that to explain the entire ranking list. We conduct extensive experiments on a variety of ranking models and report fidelity improvements of 37%-54% over existing competitors. We finally compare explanations in terms of multiple relevance factors and topic aspects to better understand the logic of ranking decisions, showcasing our explainers' practical utility.	Avishek Anand, Lijun Lyu	Delft Univ Technol, Delft, Netherlands; Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
26	Understanding and Mitigating Gender Bias in Information Retrieval Systems	Recent studies have shown that information retrieval systems may exhibit stereotypical gender biases in outcomes which may lead to discrimination against minority groups, such as different genders, and impact users' decision making and judgements. In this tutorial, we inform the audience of studies that have systematically reported the presence of stereotypical gender biases in Information Retrieval (IR) systems and different pre-trained Natural Language Processing (NLP) models. We further classify existing work on gender biases in IR systems and NLP models as being related to (1) relevance judgement datasets, (2) structure of retrieval methods, (3) representations learnt for queries and documents, (4) and pre-trained embedding models. Based on the aforementioned categories, we present a host of methods from the literature that can be leveraged to measure, control, or mitigate the existence of stereotypical biases within IR systems and different NLP models that are used for down-stream tasks. Besides, we introduce available datasets and collections that are widely used for studying the existence of gender biases in IR systems and NLP models, the evaluation metrics that can be used for measuring the level of bias and utility of the models, and de-biasing methods that can be leveraged to mitigate gender biases within those models.	Amin Bigdeli, Ebrahim Bagheri, Morteza Zihayat, Negar Arabzadeh, Shirin Seyedsalehi	Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Waterloo, Waterloo, ON, Canada
27	Investigation of Bias in Web Search Queries	The dissertation investigates the correlations and effects between biases in search queries and search query suggestions, search results, and users’ states of knowledge. Search engines are an important factor in opinion formation, while search queries determine the information a user is exposed to in information search. Search query suggestions play a crucial role in what users search for [22]. Biased query suggestions can be especially problematic if a user’s information need is not set and the interaction with query suggestions is likely. Only recently, research has started to investigate the general assumption that biased search queries lead to biased search results, focusing on political stance bias [17]. However, the correlation between biases in search queries and biases in search results has not been sufficiently investigated. Sparse context and limited data access pose challenges in detecting biases in search queries. This dissertation thus contributes datasets and methodological approaches that enable media bias research in the field of search queries and search query suggestions.	Fabian Haak	TH Koln, Gustav Heinemann Ufer 54, D-50678 Cologne, Germany
28	User Privacy in Recommender Systems	Recommender systems have become an integral part of many social networks and extract knowledge from a user's personal and sensitive data both explicitly, with the user's knowledge, and implicitly. This trend has created major privacy concerns as users are mostly unaware of what data and how much data is being used and how securely it is used. In this context, several works have been done to address privacy concerns for usage in online social network data and by recommender systems. This paper surveys the main privacy concerns, measurements and privacy-preserving techniques used in large-scale online social networks and recommender systems. It is based on historical works on security, privacy-preserving, statistical modeling, and datasets to provide an overview of the technical difficulties and problems associated with privacy preserving in online social networks.	Peter Müllner	Univ Tasmania, Sch Technol Environm & Design, Discipline ICT, Hobart, Tas, Australia
29	CLEF 2023 SimpleText Track - What Happens if General Users Search Scientific Texts?	The general public tends to avoid reliable sources such as scientific literature due to their complex language and lacking background knowledge. Instead, they rely on shallow and derived sources on the web and in social media - often published for commercial or political incentives, rather than the informational value. Can text simplification help to remove some of these access barriers? This paper presents the CLEF 2023 SimpleText track tackling technical and evaluation challenges of scientific information access for a general audience. We provide appropriate reusable data and benchmarks for scientific text simplification, and promote novel research to reduce barriers in understanding complex texts. Our overall use-case is to create a simplified summary of multiple scientific documents based on a popular science query which provides a user with an accessible overview on this specific topic. The track has the following three concrete tasks. Task 1 (What is in, or out?): selecting passages to include in a simplified summary. Task 2 (What is unclear?): difficult concept identification and explanation. Task 3 (Rewrite this!): text simplification - rewriting scientific text. The three tasks together form a pipeline of a scientific text simplification system.	Eric SanJuan, Hosein Azarbonyad, Jaap Kamps, Liana Ermakova, Olivier Augereau, Stéphane Huet	Avignon Univ, LIA, Avignon, France; ENIB, Lab STICC UMR CNRS 6285, Brest, France; Elsevier, Amsterdam, Netherlands; Univ Amsterdam, Amsterdam, Netherlands; Univ Bretagne Occidentale, HCTI, Brest, France
30	Uptrendz: API-Centric Real-Time Recommendations in Multi-domain Settings	In this work, we tackle the problem of adapting a real-time recommender system to multiple application domains, and their underlying data models and customization requirements. To do that, we present Uptrendz, a multi-domain recommendation platform that can be customized to provide real-time recommendations in an API-centric way. We demonstrate (i) how to set up a real-time movie recommender using the popular MovieLens-100 k dataset, and (ii) how to simultaneously support multiple application domains based on the use-case of recommendations in entrepreneurial start-up founding. For that, we differentiate between domains on the item- and system-level. We believe that our demonstration shows a convenient way to adapt, deploy and evaluate a recommender system in an API-centric way. The source-code and documentation that demonstrates how to utilize the configured Uptrendz API is available on GitHub.	Dieter Theiler, Dominik Kowald, Emanuel Lacic, Leon Fadljevic, Tomislav Duricic	Know-Center GmbH
31	Feature Differentiation and Fusion for Semantic Text Matching	Semantic Text Matching (STM for short) stands for the task of automatically determining the semantic similarity for a pair of texts. It has been widely applied in a variety of downstream tasks, e.g., information retrieval and question answering. The most recent works of STM leverage Pre-trained Language Models (abbr., PLMs) due to their remarkable capacity for representation learning. Accordingly, significant improvements have been achieved. However, our findings show that PLMs fail to capture task-specific features that signal hardly-perceptible changes in semantics. To overcome the issue, we propose a two-channel Feature Differentiation and Fusion network (FDF). It utilizes a PLM-based encoder to extract features separately from the unabridged texts and those abridged by deduplication. On this basis, gated feature fusion and interaction are conducted across the channels to expand text representations with attentive and distinguishable features. Experiments on the benchmarks QQP, MRPC and BQ show that FDF obtains substantial improvements compared to the baselines and outperforms the state-of-the-art STM models.	Guodong Zhou, Jianmin Yao, Rui Peng, Yu Hong, Zhiling Jin	Soochow Univ, Sch Comp Sci & Technol, Suzhou, Peoples R China
32	Privacy-Preserving Fair Item Ranking	Users worldwide access massive amounts of curated data in the form of rankings on a daily basis. The societal impact of this ease of access has been studied and work has been done to propose and enforce various notions of fairness in rankings. Current computational methods for fair item ranking rely on disclosing user data to a centralized server, which gives rise to privacy concerns for the users. This work is the first to advance research at the conjunction of producer (item) fairness and consumer (user) privacy in rankings by exploring the incorporation of privacy-preserving techniques; specifically, differential privacy and secure multi-party computation. Our work extends the equity of amortized attention ranking mechanism to be privacy-preserving, and we evaluate its effects with respect to privacy, fairness, and ranking quality. Our results using real-world datasets show that we are able to effectively preserve the privacy of users and mitigate unfairness of items without making additional sacrifices to the quality of rankings in comparison to the ranking mechanism in the clear.	Golnoosh Farnadi, Jia Ao Sun, Martine De Cock, Sikha Pentyala	Mila - Quebec AI Institute; University of Washington
33	Domain Adaptation for Anomaly Detection on Heterogeneous Graphs in E-Commerce	Anomaly detection models have been the indispensable infrastructure of e-commerce platforms. However, existing anomaly detection models on e-commerce platforms face the challenges of “cold-start” and heterogeneous graphs which contain multiple types of nodes and edges. The scarcity of labeled anomalous training samples on heterogeneous graphs hinders the training of reliable models for anomaly detection. Although recent work has made great efforts on using domain adaptation to share knowledge between similar domains, none of them considers the problem of domain adaptation between heterogeneous graphs. To this end, we propose a D omain A daptation method for heterogeneous GR aph A nomaly D etection in E -commerce ( DAGrade ). Specifically, DAGrade is designed as a domain adaptation approach to transfer our knowledge of anomalous patterns from label-rich source domains to target domains without labels. We apply a heterogeneous graph attention neural network to model complex heterogeneous graphs collected from e-commerce platforms and use an adversarial training strategy to ensure that the generated node vectors of each domain lay in the common vector space. Experiments on real-life datasets show that our method is capable of transferring knowledge across different domains and achieves satisfactory results for online deployment.	Chuan Zhou, Jia Wu, Jun Gao, Li Zheng, Zhao Li, Zhenpeng Li	Alibaba Grp, Hangzhou, Peoples R China; Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China; Macquarie Univ, Sch Comp, Sydney, NSW, Australia; Minist Educ, Key Lab High Confidence Software Technol, Beijing, Peoples R China; Peking Univ, Sch Comp Sci, Beijing, Peoples R China; Zhejiang Univ, Hangzhou, Peoples R China
34	The Impact of a Popularity Punishing Hyperparameter on ItemKNN Recommendation Performance	Collaborative filtering techniques have a tendency to amplify popularity biases present in the training data if no countermeasures are taken. The ItemKNN algorithm with conditional probability-inspired similarity function has a hyperparameter $\alpha$ that allows one to counteract this popularity bias. In this work, we perform a deep dive into the effects of this hyperparameter in both online and offline experiments, with regard to both accuracy metrics and equality of exposure. Our experiments show that the hyperparameter can indeed counteract popularity bias in a dataset. We also find that there exists a trade-off between countering popularity bias and the quality of the recommendations: Reducing popularity bias too much results in a decrease in click-through rate, but some counteracting of popularity bias is required for optimal online performance.	Bart Goethals, Jeroen Craps, Lien Michiels, Robin Verachtert	Froomle NV
35	Injecting the BM25 Score as Text Improves BERT-Based Re-rankers	In this paper we propose a novel approach for combining first-stage lexical retrieval models and Transformer-based re-rankers: we inject the relevance score of the lexical model as a token in the middle of the input of the cross-encoder re-ranker. It was shown in prior work that interpolation between the relevance score of lexical and BERT-based re-rankers may not consistently result in higher effectiveness. Our idea is motivated by the finding that BERT models can capture numeric information. We compare several representations of the BM25 score and inject them as text in the input of four different cross-encoders. We additionally analyze the effect for different query types, and investigate the effectiveness of our method for capturing exact matching relevance. Evaluation on the MSMARCO Passage collection and the TREC DL collections shows that the proposed method significantly improves over all cross-encoder re-rankers as well as the common interpolation methods. We show that the improvement is consistent for all query types. We also find an improvement in exact matching capabilities over both BM25 and the cross-encoders. Our findings indicate that cross-encoder re-rankers can efficiently be improved without additional computational burden and extra steps in the pipeline by explicitly adding the output of the first-stage ranker to the model input, and this effect is robust for different models and query types.	Amin Abolghasemi, Arian Askari, Gabriella Pasi, Suzan Verberne, Wessel Kraaij	Leiden University; University of Milano-Bicocca
36	Auditing Consumer- and Producer-Fairness in Graph Collaborative Filtering	To date, graph collaborative filtering (CF) strategies have been shown to outperform pure CF models in generating accurate recommendations. Nevertheless, recent works have raised concerns about fairness and potential biases in the recommendation landscape since unfair recommendations may harm the interests of Consumers and Producers (CP). Acknowledging that the literature lacks a careful evaluation of graph CF on CP-aware fairness measures, we initially evaluated the effects on CP-aware fairness measures of eight state-of-the-art graph models with four pure CF recommenders. Unexpectedly, the observed trends show that graph CF solutions do not ensure a large item exposure and user fairness. To disentangle this performance puzzle, we formalize a taxonomy for graph CF based on the mathematical foundations of the different approaches. The proposed taxonomy shows differences in node representation and neighbourhood exploration as dimensions characterizing graph CF. Under this lens, the experimental outcomes become clear and open the doors to a multi-objective CP-fairness analysis (Codes are available at: https://github.com/sisinflab/ECIR2023-Graph-CF .).	Claudio Pomo, Daniele Malitesta, Tommaso Di Noia, Vincenzo Paparella, Vito Walter Anelli, Yashar Deldjoo	Politecnico di Bari
37	Market-Aware Models for Efficient Cross-Market Recommendation	We consider the cross-market recommendation (CMR) task, which involves recommendation in a low-resource target market using data from a richer, auxiliary source market. Prior work in CMR utilised meta-learning to improve recommendation performance in target markets; meta-learning however can be complex and resource intensive. In this paper, we propose market-aware (MA) models, which directly model a market via market embeddings instead of meta-learning across markets. These embeddings transform item representations into market-specific representations. Our experiments highlight the effectiveness and efficiency of MA models both in a pairwise setting with a single target-source market, as well as a global model trained on all markets in unison. In the former pairwise setting, MA models on average outperform market-unaware models in 85% of cases on nDCG@10, while being time-efficient - compared to meta-learning models, MA models require only 15% of the training time. In the global setting, MA models outperform market-unaware models consistently for some markets, while outperforming meta-learning-based methods for all but one market. We conclude that MA models are an efficient and effective alternative to meta-learning, especially in the global setting.	Evangelos Kanoulas, Mohammad Aliannejadi, Samarth Bhargav	University of Amsterdam
38	Recommendation Algorithm Based on Deep Light Graph Convolution Network in Knowledge Graph	Recently, recommendation algorithms based on Graph Convolution Network (GCN) have achieved many surprising results thanks to the ability of GCN to learn more efficient node embeddings. However, although GCN shows powerful feature extraction capability in user-item bipartite graphs, the GCN-based methods appear powerless for knowledge graph (KG) with complex structures and rich information. In addition, all of the existing GCN-based recommendation systems suffer from the over-smoothing problem, which results in the models not being able to utilize higher-order neighborhood information, and thus these models always achieve their best performance at shallower layers. In this paper, we propose a Deep Light Graph Convolution Network for Knowledge Graph (KDL-GCN) to alleviate the above limitations. Firstly, the User-Entity Bipartite Graph approach (UE-BP) is proposed to simplify knowledge graph, which leverages entity information by constructing multiple interaction graphs. Secondly, a Deep Light Graph Convolution Network (DLGCN) is designed to make full use of higher-order neighborhood information. Finally, experiments on three real-world datasets show that the KDL-GCN proposed in this paper achieves substantial improvement compared to the state-of-the-art methods.	Nanfeng Xiao, Xiaobin Chen	South China University of Technology
39	Query Performance Prediction for Neural IR: Are We There Yet?	Evaluation in Information Retrieval (IR) relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods – with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning ’19 and Robust ’04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.	Benjamin Piwowarski, Guglielmo Faggioli, Nicola Ferro, Stefano Marchesin, Stéphane Clinchant, Thibault Formal	Naver Labs Europe, Meylan, France; Sorbonne Univ, ISIR, Paris, France; Univ Padua, Padua, Italy
40	Viewpoint Diversity in Search Results	The way pages are ranked in search results influences whether the users of search engines are exposed to more homogeneous, or rather to more diverse viewpoints. However, this viewpoint diversity is not trivial to assess. In this paper, we use existing and novel ranking fairness metrics to evaluate viewpoint diversity in search result rankings. We conduct a controlled simulation study that shows how ranking fairness metrics can be used for viewpoint diversity, how their outcome should be interpreted, and which metric is most suitable depending on the situation. This paper lays out important groundwork for future research to measure and assess viewpoint diversity in real search result rankings.	Alisa Rieger, Benjamin Timmermans, Mehmet Orcun Yalcin, Nava Tintarev, Nirmal Roy, Oana Inel, Rishav Hada, Tim Draws	IBM; TU Delft
41	Sentence Retrieval for Open-Ended Dialogue Using Dual Contextual Modeling	We address the task of retrieving sentences for an open domain dialogue that contain information useful for generating the next turn. We propose several novel neural retrieval architectures based on dual contextual modeling: the dialogue context and the context of the sentence in its ambient document. The architectures utilize contextualized language models (BERT), fine-tuned on a large-scale dataset constructed from Reddit. We evaluate the models using a recently published dataset. The performance of our most effective model is substantially superior to that of strong baselines.	Hagai Taitelbaum, Idan Szpektor, Itay Harel, Oren Kurland	Google Res, Tel Aviv, Israel; TSG IT Adv Syst Ltd, Tel Aviv, Israel; Technion Israel Inst Technol, Haifa, Israel
42	Neural Approaches to Multilingual Information Retrieval	Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.	Dawn J. Lawrie, Douglas W. Oard, Eugene Yang, James Mayfield	Johns Hopkins Univ, HLTCOE, Baltimore, MD 21211 USA
43	Automatic and Analytical Field Weighting for Structured Document Retrieval	Probabilistic models such as BM25 and LM have established themselves as the standard in atomic retrieval. In structured document retrieval (SDR), BM25F could be considered the most established model. However, without optimization BM25F does not benefit from the document structure. The main contribution of this paper is a new field weighting method, denoted Information Content Field Weighting (ICFW). It applies weights over the structure without optimization and overcomes issues faced by some existing SDR models, most notably the issue of saturating term frequency across fields. ICFW is similar to BM25 and LM in its analytical grounding and transparency, making it a potential new candidate for a standard SDR model. For an optimised retrieval scenario ICFW does as well, or better than baselines. More interestingly, for a non-optimised retrieval scenario we observe a considerable increase in performance. Extensive analysis is performed to understand and explain the underlying reasons for this increase.	Thomas Roelleke, Tuomas Ketola	Queen Mary Univ London, London, England
44	SR-CoMbEr: Heterogeneous Network Embedding Using Community Multi-view Enhanced Graph Convolutional Network for Automating Systematic Reviews	Systematic reviews (SRs) are a crucial component of evidence-based clinical practice. Unfortunately, SRs are labor-intensive and unscalable with the exponential growth in literature. Automating evidence synthesis using machine learning models has been proposed but solely focuses on the text and ignores additional features like citation information. Recent work demonstrated that citation embeddings can outperform the text itself, suggesting that better network representation may expedite SRs. Yet, how to utilize the rich information in heterogeneous information networks (HIN) for network embeddings is understudied. Existing HIN models fail to produce a high-quality embedding compared to simply running state-of-the-art homogeneous network models. To address existing HIN model limitations, we propose SR-CoMbEr, a community-based multi-view graph convolutional network for learning better embeddings for evidence synthesis. Our model automatically discovers article communities to learn robust embeddings that simultaneously encapsulate the rich semantics in HINs. We demonstrate the effectiveness of our model to automate 15 SRs.	Eric Wonhee Lee, Joyce C. Ho	Emory Univ, Atlanta, GA 30322 USA
45	MS-Shift: An Analysis of MS MARCO Distribution Shifts on Neural Retrieval	Pre-trained Language Models have recently emerged in Information Retrieval as providing the backbone of a new generation of neural systems that outperform traditional methods on a variety of tasks. However, it is still unclear to what extent such approaches generalize in zero-shot conditions. The recent BEIR benchmark provides partial answers to this question by comparing models on datasets and tasks that differ from the training conditions. We aim to address the same question by comparing models under more explicit distribution shifts. To this end, we build three query-based distribution shifts within MS MARCO (query-semantic, query-intent, query-length), which are used to evaluate the three main families of neural retrievers based on BERT: sparse, dense, and late-interaction - as well as a monoBERT re-ranker. We further analyse the performance drops between the train and test query distributions. In particular, we experiment with two generalization indicators: the first one based on train/test query vocabulary overlap, and the second based on representations of a trained bi-encoder. Intuitively, those indicators verify that the further away the test set is from the train one, the worse the drop in performance. We also show that models respond differently to the shifts - dense approaches being the most impacted. Overall, our study demonstrates that it is possible to design more controllable distribution shifts as a tool to better understand generalization of IR models. Finally, we release the MS MARCO query subsets, which provide an additional resource to benchmark zero-shot transfer in Information Retrieval.	Simon Lupart, Stéphane Clinchant, Thibault Formal	Naver Labs Europe, Meylan, France
46	Improving Video Retrieval Using Multilingual Knowledge Transfer	Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k . Experimental results demonstrate that our approach achieves state-of-the-art results on all video retrieval datasets outperforming previous models. Additionally, our framework MuMUR significantly beats other multilingual video retrieval dataset. We also observe that MuMUR exhibits strong performance on image retrieval. This demonstrates the universal ability of MuMUR to perform retrieval across all visual inputs (image and video) and text inputs (monolingual and multilingual).	Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Gedas Bertasius, ShaoYen Tseng, Vasudev Lal	Intel Labs; University of North Carolina at Chapel Hill
47	HADA: A Graph-Based Amalgamation Framework in Image-text Retrieval	Many models have been proposed for vision and language tasks, especially the image-text retrieval task. State-of-the-art (SOTA) models in this challenge contain hundreds of millions of parameters. They also were pretrained on large external datasets that have been proven to significantly improve overall performance. However, it is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models already available to use on the Internet. In this paper, we propose a compact graph-based framework named HADA, which can combine pretrained models to produce a better result rather than starting from scratch. Firstly, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we employed cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments show that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than $\$ 3.6%$$ in terms of evaluation metrics on the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and only required a single GPU to train due to the small number of parameters required. The source code is available at https://github.com/m2man/HADA .	Binh T. Nguyen, Cathal Gurrin, ManhDuy Nguyen	Dublin City University; VNU-HCM, University of Science
48	Knowledge is Power, Understanding is Impact: Utility and Beyond Goals, Explanation Quality, and Fairness in Path Reasoning Recommendation	Path reasoning is a notable recommendation approach that models high-order user-product relations, based on a Knowledge Graph (KG). This approach can extract reasoning paths between recommended products and already experienced products and, then, turn such paths into textual explanations for the user. Unfortunately, evaluation protocols in this field appear heterogeneous and limited, making it hard to contextualize the impact of the existing methods. In this paper, we replicated three state-of-the-art relevant path reasoning recommendation methods proposed in top-tier conferences. Under a common evaluation protocol, based on two public data sets and in comparison with other knowledge-aware methods, we then studied the extent to which they meet recommendation utility and beyond objectives, explanation quality, and consumer and provider fairness. Our study provides a picture of the progress in this field, highlighting open issues and future directions. Source code: https://github.com/giacoballoccu/rep-path-reasoning-recsys .	Christian Cancedda, Giacomo Balloccu, Gianni Fenu, Ludovico Boratto, Mirko Marras	Polytechnic University of Turin; University of Cagliari
49	Scene-Centric vs. Object-Centric Image-Text Cross-Modal Retrieval: A Reproducibility Study	Most approaches to (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-the-art CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the relative performance of the selected models on these datasets. We focus on reproducibility, replicability, and generalizability of the outcomes of previously published CMR experiments. We discover that the experiments are not fully reproducible and replicable. Besides, the relative performance results partially generalize across object-centric and scene-centric datasets. On top of that, the scores obtained on object-centric datasets are much lower than the scores obtained on scene-centric datasets. For reproducibility and transparency we make our source code and the trained models publicly available.	Ernst Kuiper, Maarten de Rijke, Mariya Hendriksen, Svitlana Vakulenko	Amazon, Madrid, Spain; Bol com, Utrecht, Netherlands; Univ Amsterdam, AIRLab, Amsterdam, Netherlands; Univ Amsterdam, Amsterdam, Netherlands
50	A Reproducibility Study of Question Retrieval for Clarifying Questions	The use of clarifying questions within a search system can have a key role in improving retrieval effectiveness. The generation and exploitation of clarifying questions is an emerging area of research in information retrieval, especially in the context of conversational search. In this paper, we attempt to reproduce and analyse a milestone work in this area. Through close communication with the original authors and data sharing, we were able to identify a key issue that impacted the original experiments and our independent attempts at reproduction; this issue relates to data preparation. In particular, the clarifying questions retrieval task consists of retrieving clarifying questions from a question bank for a given query. In the original data preparation, such question bank was split into separate folds for retrieval – each split contained (approximately) a fifth of the data in the full question bank. This setting does not resemble that of a production system; in addition, it also was only applied to learnt methods, while keyword matching methods used the full question bank. This created inconsistency in the reporting of the results and overestimated findings. We demonstrate this through a set of empirical experiments and analyses.	Ahmed Mourad, Guido Zuccon, Sebastian Cross	Univ Queensland, St Lucia, Australia
51	Index-Based Batch Query Processing Revisited	Large scale web search engines provide sub-second response times to interactive user queries. However, not all search traffic arises interactively – cache updates, internal testing and prototyping, generation of training data, and web mining tasks all contribute to the workload of a typical search service. If these non-interactive query components are collected together and processed as a batch, the overall execution cost of query processing can be significantly reduced. In this reproducibility study, we revisit query batching in the context of large-scale conjunctive processing over inverted indexes, considering both on-disk and in-memory index arrangements. Our exploration first verifies the results reported in the reference work [Ding et al., WSDM 2011], and then provides novel approaches for batch processing which give rise to better time–space trade-offs than have been previously achieved.	Alistair Moffat, Joel Mackenzie	The University of Melbourne; The University of Queensland
52	A Unified Framework for Learned Sparse Retrieval	Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method’s effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available (Code: https://github.com/thongnt99/learned-sparse-retrieval ).	Andrew Yates, Sean MacAvaney, Thong Nguyen	Univ Amsterdam, Amsterdam, Netherlands; Univ Glasgow, Glasgow, Scotland
53	Do the Findings of Document and Passage Retrieval Generalize to the Retrieval of Responses for Dialogues?	A number of learned sparse and dense retrieval approaches have recently been proposed and proven effective in tasks such as passage retrieval and document retrieval. In this paper we analyze with a replicability study if the lessons learned generalize to the retrieval of responses for dialogues, an important task for the increasingly popular field of conversational search. Unlike passage and document retrieval where documents are usually longer than queries, in response ranking for dialogues the queries (dialogue contexts) are often longer than the documents (responses). Additionally, dialogues have a particular structure, i.e. multiple utterances by different users. With these differences in mind, we here evaluate how generalizable the following major findings from previous works are: (F1) query expansion outperforms a no-expansion baseline; (F2) document expansion outperforms a no-expansion baseline; (F3) zero-shot dense retrieval underperforms sparse baselines; (F4) dense retrieval outperforms sparse baselines; (F5) hard negative sampling is better than random sampling for training dense models. Our experiments ( https://github.com/Guzpenha/transformer_rankers/tree/full_rank_retrieval_dialogues .)—based on three different information-seeking dialogue datasets—reveal that four out of five findings ( F2 – F5 ) generalize to our domain.	Claudia Hauff, Gustavo Penha	Delft Univ Technol, Delft, Netherlands
54	From Baseline to Top Performer: A Reproducibility Study of Approaches at the TREC 2021 Conversational Assistance Track	This paper reports on an effort of reproducing the organizers’ baseline as well as the top performing participant submission at the 2021 edition of the TREC Conversational Assistance track. TREC systems are commonly regarded as reference points for effectiveness comparison. Yet, the papers accompanying them have less strict requirements than peer-reviewed publications, which can make reproducibility challenging. Our results indicate that key practical information is indeed missing. While the results can be reproduced within a 19% relative margin with respect to the main evaluation measure, the relative difference between the baseline and the top performing approach shrinks from the reported 18% to 5%. Additionally, we report on a new set of experiments aimed at understanding the impact of various pipeline components. We show that end-to-end system performance can indeed benefit from advanced retrieval techniques in either stage of a two-stage retrieval pipeline. We also measure the impact of the dataset used for fine-tuning the query rewriter and find that employing different query rewriting methods in different stages of the retrieval pipeline might be beneficial. Moreover, these results are shown to generalize across the 2020 and 2021 editions of the track. We conclude our study with a list of lessons learned and practical suggestions.	Krisztian Balog, Weronika Lajewska	University of Stavanger
55	The System for Efficient Indexing and Search in the Large Archives of Scanned Historical Documents	The paper introduces software capable of indexing and searching large archives of scanned historical documents. The system capabilities are demonstrated on the collection containing documents from the archives of the post-Soviet security services. The backend of the system was designed with a focus on flexibility (it is actually already being used for other related tasks) and scalability to larger volumes of data. The graphical user interface design has been consulted with historians interested in using the archived documents and was developed in several iterations, gradually including the changes induced both by the user’s requests and by our improving knowledge about the nature of the processed data.	Jan Svec, Martin Bulín, Pavel Ircing	University of West Bohemia
56	Public News Archive: A Searchable Sub-archive to Portuguese Past News Articles	Over the past few decades, the amount of information generated turned the Web into the largest knowledge infrastructure existing to date. Web archives have been at the forefront of data preservation, preventing the losses of significant data to humankind. Different snapshots of the web are saved everyday enabling users to surf the past web and to travel through this overtime. Despite these efforts, many people are not aware that the web is being preserved, often finding these infrastructures to be unattractive or difficult to use, when compared to common search engines. In this paper, we give a step towards making use of this preserved information to develop “ Public Archive ” an intuitive interface that enables end-users to search and analyze a large-scale of 67,242 past preserved news articles belonging to a Portuguese reference newspaper (“ Jornal Público ”). The referred collection was obtained by scraping 10,976 versions of the homepage of the “ Jornal Público ” preserved by the Portuguese web archive infrastructure (Arquivo.pt) during the time-period of 2010 to 2021. By doing this, we aim, not only to mark a stand in what respects to make use of this preserved information, but also to come up with an easy-to-follow solution, the Public Archive python package, which creates the roots to be used (with minor adaptations) by other news source providers interested in offering their readers access to past news articles.	Adam Jatowt, Diogo Correia, Ricardo Campos	LIAAD INESCTEC, Porto, Portugal; Polytech Inst Tomar, Ci2 Smart Cities Res Ctr, Tomar, Portugal; Univ Innsbruck, Innsbruck, Austria
57	Which Country Is This? Automatic Country Ranking of Street View Photos	In this demonstration, we present Country Guesser, a live system that guesses the country that a photo is taken in. In particular, given a Google Street View image, our federated ranking model uses a combination of computer vision, machine learning and text retrieval methods to compute a ranking of likely countries of the location shown in a given image from Street View. Interestingly, using text-based features to probe large pre-trained language models can assist to provide cross-modal supervision. We are not aware of previous country guessing systems informed by visual and textual features.	Florian Mittag, Jochen L. Leidner, Tim Menzner	Coburg Univ Appl Sci & Arts, Friedrich Streib Str 2, D-96450 Coburg, Germany
58	ECIR 23 Tutorial: Neuro-Symbolic Approaches for Information Retrieval	This tutorial will provide an overview of recent advances on neuro-symbolic approaches for information retrieval. A decade ago, knowledge graphs and semantic annotations technology led to active research on how to best leverage symbolic knowledge. At the same time, neural methods have demonstrated to be versatile and highly effective. From a neural network perspective, the same representation approach can service document ranking or knowledge graph reasoning. End-to-end training allows to optimize complex methods for downstream tasks. We are at the point where both the symbolic and the neural research advances are coalescing into neuro-symbolic approaches. The underlying research questions are how to best combine symbolic and neural approaches, what kind of symbolic/neural approaches are most suitable for which use case, and how to best integrate both ideas to advance the state of the art in information retrieval.	Arjen P. de Vries, Edgar Meij, Hannah Bast, Jeff Dalton, Laura Dietz, Shubham Chatterjee	Bloomberg; Radboud University; University of Freiburg; University of Glasgow; University of New Hampshire
59	Crowdsourcing for Information Retrieval	In our tutorial, we will share more than six years of our crowdsourced data labeling experience and bridge the gap between crowdsourcing and information retrieval communities by showing how one can incorporate human-in-the-loop into their retrieval system to gather the real human feedback on the model predictions. Most of the tutorial time is devoted to a hands-on practice, when the attendees will, under our guidance, implement an end-to-end process for information retrieval from problem statement and data labeling to machine learning model training and evaluation.	Alisa Smirnova, Dmitry Ustalov, Natalia Fedorova, Nikita Pavlichenko	Toloka
60	Deep Learning Methods for Query Auto Completion	Query Auto Completion (QAC) aims to help users reach their search intent faster and is a gateway to search for users. Everyday, billions of keystrokes across hundreds of languages are served by Bing Autosuggest in less than 100 ms. The expected suggestions could differ depending on user demography, previous search queries and current trends. In general, the suggestions in the AutoSuggest block are expected to be relevant, personalized, fresh, diverse and need to be guarded against being defective, hateful, adult or offensive in any way. In this tutorial, we will first discuss about various critical components in QAC systems. Further, we will discuss details about traditional machine learning and deep learning architectures proposed for four main components: ranking in QAC, personalization, spell corrections and natural language generation for QAC.	Manish Gupta, Meghana Joshi, Puneet Agrawal	Microsoft
61	Trends and Overview: The Potential of Conversational Agents in Digital Health	With the COVID-19 pandemic serving as a trigger, 2020 saw an unparalleled global expansion of tele-health [23]. Tele-health successfully lowers the need for in-person consultations and, thus, the danger of contracting a virus. While the COVID-19 pandemic sped up the adoption of virtual healthcare delivery in numerous nations, it also accelerated the creation of a wide range of other different technology-enabled systems and procedures for providing virtual healthcare to patients. Rightly so, the COVID-19 has brought many difficulties for patients ( https://www.who.int/news/item/02-03-2022-covid-19-pandemic-triggers-25-increase-in-prevalence-of-anxiety-and-depression-worldwide ) who need continuing care and monitoring for mental health issues and/or other chronic diseases.	Abhishek Tiwari, Sriparna Saha, Tulika Saha	Indian Inst Technol Patna, Daulatpur, India; Univ Liverpool, Liverpool, Merseyside, England
62	QPP++ 2023: Query-Performance Prediction and Its Evaluation in New Tasks	Query-Performance Prediction (QPP) is currently primarily applied to ad-hoc retrieval tasks. The Information Retrieval (IR) field is reaching new heights thanks to recent advances in large language models and neural networks, as well as emerging new ways of searching, such as conversational search. Such advancements are quickly spreading to adjacent research areas, including QPP, necessitating a reconsideration of how we perform and evaluate QPP. This workshop sought to elicit discussion on three topics related to the future of QPP: exploiting advances in IR to improve QPP, instantiating QPP on new search paradigms, and evaluating QPP on new tasks.	Fiana Raiber, Guglielmo Faggioli, Josiane Mothe, Nicola Ferro	Univ Padua, Padua, Italy; Univ Toulouse, CNRS, INSPE, IRIT UMR5505, Toulouse, France; Yahoo Res, Haifa, Israel
63	Text Information Retrieval in Tetun	Tetun is one of Timor-Leste’s official languages alongside Portuguese. It is a low-resource language with over 932,000 speakers that started developing when Timor-Leste restored its independence in 2002. Newspapers mainly use Tetun and more than ten national online news websites actively broadcast news in Tetun every day. However, since information retrieval-based solutions for Tetun do not exist, finding Tetun information on the internet and digital platforms is challenging. This work aims to investigate and develop solutions that can enable the application of information retrieval techniques to develop search solutions for Tetun using Tetun INL and focus on the ad-hoc text retrieval task. As a result, we expect to have effective search solutions for Tetun and contribute to the innovation in information retrieval for low-resource languages, including making Tetun datasets available for future researchers.	Gabriel de Jesus	Univ Porto FEUP, INESC TEC, Rua Dr Roberto Frias, P-4200465 Porto, Portugal
64	Overview of Touché 2023: Argument and Causal Retrieval - Extended Abstract		Alexander Bondarenko, Benno Stein, Brian Ravenet, Ferdinand Schlatt, Jan Heinrich Reimer, Johannes Kiesel, Léo Hemamou, Maik Fröbe, Martin Potthast, Matthias Hagen, Simon Luck, Valentin Barrière
65	Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets	Dense retrieval methods have surpassed traditional sparse retrieval methods for open-domain retrieval. While these methods, such as the Dense Passage Retriever (DPR), work well on datasets or domains they have been trained on, there is a noticeable loss in accuracy when tested on out-of-distribution and out-of-domain datasets. We hypothesize that this may be, in large part, due to the mismatch in the information available to the context encoder and the query encoder during training. Most training datasets commonly used for training dense retrieval models contain an overwhelming majority of passages where there is only one query from a passage. We hypothesize that this imbalance encourages dense retrieval models to overfit to a single potential query from a given passage leading to worse performance on out-of-distribution and out-of-domain queries. To test this hypothesis, we focus on a prominent dense retrieval method, the dense passage retriever, build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on single query per passage datasets. Using the generated datasets, we show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets.	Maarten de Rijke, Thilina Rajapakse	University of Amsterdam
66	SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval	Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.	Fabio Carrara, Giuseppe Amato, Jan Sedmidubský	ISTI CNR, Pisa, Italy; Masaryk Univ, Brno, Czech Republic
67	DeCoDE: Detection of Cognitive Distortion and Emotion Cause Extraction in Clinical Conversations	Despite significant evidence linking mental health to almost every major development issue, individuals with mental disorders are among those most at risk of being excluded from development programs. We outline a novel task of detection of Cognitive Distortion and Emotion Cause extraction of associated emotions in conversations. Cognitive distortions are inaccurate thought patterns, beliefs, or perceptions that contribute to negative thinking, which subsequently elevates the chances of several mental illnesses. This work introduces a novel multi-modal mental health conversational corpus manually annotated with emotion , emotion causes , and the presence of cognitive distortion at the utterance level. We propose a multitasking framework that uses multi-modal information as inputs and uses both external commonsense knowledge and factual knowledge from the dataset to learn both tasks at the same time. This is because commonsense knowledge is a key part of understanding how and why emotions are implied. We achieve commendable performance gains on the cognitive distortion detection task (+3.91 F1%) and the emotion cause extraction task (+3 ROS points) when compared to the existing state-of-the-art model.	Asif Ekbal, Gopendra Vikram Singh, Pushpak Bhattacharyya, Soumitra Ghosh	Indian Institute of Technology Bombay; Indian Institute of Technology Patna
68	Topics in Contextualised Attention Embeddings	Contextualised word vectors obtained via pre-trained language models encode a variety of knowledge that has already been exploited in applications. Complementary to these language models are probabilistic topic models that learn thematic patterns from the text. Recent work has demonstrated that conducting clustering on the word-level contextual representations from a language model emulates word clusters that are discovered in latent topics of words from Latent Dirichlet Allocation. The important question is how such topical word clusters are automatically formed, through clustering, in the language model when it has not been explicitly designed to model latent topics. To address this question, we design different probe experiments. Using BERT and DistilBERT, we find that the attention framework plays a key role in modelling such word topic clusters. We strongly believe that our work paves way for further research into the relationships between probabilistic topic models and pre-trained language models.	Alba García Seco de Herrera, Mozhgan Talebpour, Shoaib Jameel	University of Essex; University of Southampton
69	New Metrics to Encourage Innovation and Diversity in Information Retrieval Approaches	In evaluation campaigns, participants often explore variations of popular, state-of-the-art baselines as a low-risk strategy to achieve competitive results. While effective, this can lead to local “hill climbing” rather than a more radical and innovative departure from standard methods. Moreover, if many participants build on similar baselines, the overall diversity of approaches considered may be limited. In this work, we propose a new class of IR evaluation metrics intended to promote greater diversity of approaches in evaluation campaigns. Whereas traditional IR metrics focus on user experience, our two “innovation” metrics instead reward exploration of more divergent, higher-risk strategies finding relevant documents missed by other systems. Experiments on four TREC collections show that our metrics do change system rankings by rewarding systems that find such rare, relevant documents. This result is further supported by a controlled, synthetic data experiment, and a qualitative analysis. In addition, we show that our metrics achieve higher evaluation stability and discriminative power than the standard metrics we modify. To support reproducibility, we share our source code.	Matthew Lease, Mehmet Deniz Türkmen, Mücahid Kutlu	TOBB Univ Econ & Technol, Dept Comp Engn, Ankara, Turkiye; Univ Texas Austin, Sch Informat, Austin, TX USA
70	Probing BERT for Ranking Abilities	Contextual models like BERT are highly effective in numerous text-ranking tasks. However, it is still unclear as to whether contextual models understand well-established notions of relevance that are central to IR. In this paper, we use probing , a recent approach used to analyze language models, to investigate the ranking abilities of BERT-based rankers. Most of the probing literature has focussed on linguistic and knowledge-aware capabilities of models or axiomatic analysis of ranking models. In this paper, we fill an important gap in the information retrieval literature by conducting a layer-wise probing analysis using four probes based on lexical matching, semantic similarity as well as linguistic properties like coreference resolution and named entity recognition. Our experiments show an interesting trend that BERT-rankers better encode ranking abilities at intermediate layers. Based on our observations, we train a ranking model by augmenting the ranking data with the probe data to show initial yet consistent performance improvements (The code is available at https://github.com/yolomeus/probing-search/ ).	Abhijit Anand, Avishek Anand, Fabian Beringer, Jonas Wallat	L3S Research Center
71	Graph Contrastive Learning with Positional Representation for Recommendation	Recently, graph neural networks have become the state-of-the-art in collaborative filtering, since the interactions between users and items essentially have a graph structure. However, a major issue with the user-item interaction graph in recommendation is the absence of the positional information of users/items, which limits the expressive power of graph recommenders in distinguishing the users/items with the same neighbours after propagating several graph convolution layers. Such a phenomenon further induces the well-known over-smoothing problem. We hypothesise that we can obtain a more expressive graph recommender through graph positional encoding (e.g., Laplacian eigenvector) thereby also alleviating the over-smoothing problem. Hence, we propose a novel model named Positional Graph Contrastive Learning (PGCL) for top-K recommendation, which aims to explicitly enhance graph representation learning with graph positional encoding in a contrastive learning manner. We show that concatenating the learned graph positional encoding and the pre-existing users/items' features in each feature propagation layer can achieve significant effectiveness gains. To further have sufficient representation learning from the graph positional encoding, we use contrastive learning to jointly learn the correlation between the pre-exiting users/items' features and the positional information. Our extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed PGCL model over existing state-of-the-art graph-based recommendation approaches in terms of both effectiveness and alleviating the over-smoothing problem.	Craig Macdonald, Iadh Ounis, Zixuan Yi	Univ Glasgow, Glasgow, Scotland
72	Is Cross-Modal Information Retrieval Possible Without Training?	Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture.	Hyunjae Lee, Hyunjin Choi, Seongho Joe, Youngjune Gwon	Samsung SDS
73	Doc2Query-: When Less is More	Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration ( https://github.com/terrierteam/pyterrier_doc2query ).	Craig Macdonald, Mitko Gospodinov, Sean MacAvaney	University of Glasgow
74	Leveraging Comment Retrieval for Code Summarization	Open-source code often suffers from mismatched or missing comments, leading to difficult code comprehension, and burdening software development and maintenance. In this paper, we design a novel code summarization model CodeFiD to address this laborious challenge. Inspired by retrieval-augmented methods for open-domain question answering, CodeFiD first retrieves a set of relevant comments from code collections for a given code, and then aggregates presentations of code and these comments to produce a natural language sentence that summarizes the code behaviors. Different from current code summarization works that focus on improving code representations, our model resorts to external knowledge to enhance code summarizing performance. Extensive experiments on public code collections demonstrate the effectiveness of CodeFiD by outperforming state-of-the-art counterparts across all programming languages.	Lingwei Chen, Mingxuan Ju, Shifu Hou, Yanfang Ye	Univ Notre Dame, Notre Dame, IN 46556 USA; Wright State Univ, Dayton, OH 45435 USA
75	C2LIR: Continual Cross-Lingual Transfer for Low-Resource Information Retrieval	This paper proposes a method to train information retrieval (IR) model for a low-resource language with a small corpus and no parallel sentences. Although neural IR models based on pretrained language models (PLMs) have shown high performance in high-resource languages (HRLs), building PLM for LRLs is challenging. We propose C $^2$ LIR, a method to build a high-performing neural IR model for LRL, with dictionary-based pretraining objectives for cross-lingual transfer from HRL. Experiments on the monolingual and cross-lingual IR in diverse low-resource scenarios show the effectiveness and data efficiency of C $^2$ LIR.	Dohyeon Lee, Jaeseong Lee, Jongho Kim, Seungwon Hwang	Seoul National University
76	Joint Extraction and Classification of Danish Competences for Job Matching	The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters' productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from existing works on skill extraction and skill classification, our model is trained on a large volume of annotated Danish corpora and is capable of extracting a wide range of danish competences, including skills, occupations and knowledges of different categories. More importantly, as a single BERT-like architecture for joint extraction and classification, our model is lightweight and efficient at inference. On a real-scenario job matching dataset, our model beats the state-of-the-art models in the overall performance of Danish competence extraction and classification, and saves over 50% time at inference.	Christina Lioma, Qiuchi Li	Univ Copenhagen, Univ Pk 1, DK-2100 Copenhagen, Denmark
77	A Study on FGSM Adversarial Training for Neural Retrieval	Neural retrievalmodels have acquired significant effectiveness gains over the last few years compared to term-based methods. Nevertheless, thosemodelsmay be brittle when faced to typos, distribution shifts or vulnerable to malicious attacks. For instance, several recent papers demonstrated that such variations severely impacted models performances, and then tried to train more resilient models. Usual approaches include synonyms replacements or typos injections - as data-augmentation - and the use of more robust tokenizers (characterBERT, BPE-dropout). To further complement the literature, we investigate in this paper adversarial training as another possible solution to this robustness issue. Our comparison includes the two main families of BERT-based neural retrievers, i.e. dense and sparse, with andwithout distillation techniques. We then demonstrate that one of the most simple adversarial training techniques - the Fast Gradient Sign Method (FGSM) - can improve first stage rankers robustness and effectiveness. In particular, FGSM increases models performances on both in-domain and out-of-domain distributions, and also on queries with typos, for multiple neural retrievers.	Simon Lupart, Stéphane Clinchant	Naver Labs Europe, Meylan, France
78	Time-Dependent Next-Basket Recommendations	There are various real-world applications for next-basket recommender systems. One of them is guiding a website user who wants to buy anything toward a collection of items. Recent works demonstrate that methods based on the frequency of prior purchases outperform other deep learning algorithms in terms of performance. These techniques, however, do not consider timestamps and time intervals between interactions. Additionally, they often miss the time period that passes between the last known basket and the prediction time. In this study, we explore whether such knowledge could improve current state-of-the-art next-basket recommender systems. Our results on three real-world datasets show how such enhancement may increase prediction quality. These findings might pave the way for important research directions in the field of next-basket recommendations.	Dmitry I. Ignatov, Marina Ananyeva, Oleg Lashinin, Sergey Kolesnikov, Sergey Naumov	National Research University Higher School of Economics; Tinkoff
79	Dialogue-to-Video Retrieval	Recent years have witnessed an increasing amount of dialogue/conversation on the web especially on social media. That inspires the development of dialogue-based retrieval, in which retrieving videos based on dialogue is of increasing interest for recommendation systems. Different from other video retrieval tasks, dialogue-to-video retrieval uses structured queries in the form of user-generated dialogue as the search descriptor. We present a novel dialogue-to-video retrieval system, incorporating structured conversational information. Experiments conducted on the AVSD dataset show that our proposed approach using plain-text queries improves over the previous counterpart model by 15.8% on R@1. Furthermore, our approach using dialogue as a query, improves retrieval performance by 4.2%, 6.2%, 8.6% on R@1, R@5 and R@10 and outperforms the state-of-the-art model by 0.7%, 3.6% and 6.0% on R@1, R@5 and R@10 respectively.	Cathal Gurrin, Chenyang Lyu, Jennifer Foster, Liting Zhou, ManhDuy Nguyen, VanTu Ninh	Dublin City Univ, Sch Comp, Dublin, Ireland
80	Visconde: Multi-document QA with GPT-3 and Neural Reranking	This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at https://github.com/neuralmind-ai/visconde .	Jayr Alencar Pereira, Roberto de Alencar Lotufo, Robson do Nascimento Fidalgo, Rodrigo Frassetto Nogueira	NeuralMind; Universidade Federal de Pernambuco
81	Towards Linguistically Informed Multi-objective Transformer Pre-training for Natural Language Inference	We introduce a linguistically enhanced combination of pre-training methods for transformers. The pre-training objectives include POS-tagging, synset prediction based on semantic knowledge graphs, and parent prediction based on dependency parse trees. Our approach achieves competitive results on the Natural Language Inference task, compared to the state of the art. Specifically for smaller models, the method results in a significant performance boost, emphasizing the fact that intelligent pre-training can make up for fewer parameters and help building more efficient models. Combining POS-tagging and synset prediction yields the overall best results.	Lisa Pucknat, Maren Pielka, Rafet Sifa, Svetlana Schmidt	Fraunhofer IAIS, Schloss Birlinghoven
82	Don't Raise Your Voice, Improve Your Argument: Learning to Retrieve Convincing Arguments	The Information Retrieval community has made strides in developing neural rankers, which have show strong retrieval effectiveness on large-scale gold standard datasets. The focus of existing neural rankers has primarily been on measuring the relevance of a document or passage to the user query. However, other considerations such as the convincingness of the content are not taken into account when retrieving content. We present a large gold standard dataset, referred to as CoRe, which focuses on enabling researchers to explore the integration of the concepts of convincingness and relevance to allow for the retrieval of relevant yet persuasive content. Through extensive experiments on this dataset, we report that there is a close association between convincingness and relevance that can have practical value in how convincing content are presented and retrieved in practice.	Amin Bigdeli, Ebrahim Bagheri, Morteza Zihayat, Negar Arabzadeh, Sara Salamat, Shirin Seyedsalehi	Toronto Metropolitan University; University of Waterloo
83	Neural Ad-Hoc Retrieval Meets Open Information Extraction	This paper presents the idea of systematically integrating relation triples derived from Open Information Extraction (OpenIE) with neural rankers in order to improve the performance of the ad-hoc retrieval task. This is motivated by two reasons: (1) to capture longer-range semantic associations between keywords in documents, which would not otherwise be immediately identifiable by neural rankers; and (2) identify closely mentioned yet semantically unrelated content in the document that could lead to a document being incorrectly considered to be relevant for the query. Through our extensive experiments on three widely used TREC collections, we show that our idea consistently leads to noticeable performance improvements for neural rankers on a range of metrics.	Ba Pham, DucThuan Vo, Ebrahim Bagheri, Fattane Zarrinkalam, Negar Arabzadeh, Sara Salamat	Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Guelph, Guelph, ON, Canada; Univ Toronto, Toronto, ON, Canada; Univ Waterloo, Waterloo, ON, Canada
84	Evolution of Filter Bubbles and Polarization in News Recommendation	Recent work in news recommendation has demonstrated that recommenders can over-expose users to articles that support their pre-existing opinions. However, most existing work focuses on a static setting or over a short-time window, leaving open questions about the long-term and dynamic impacts of news recommendations. In this paper, we explore these dynamic impacts through a systematic study of three research questions: 1) How do the news reading behaviors of users change after repeated long-term interactions with recommenders? 2) How do the inherent preferences of users change over time in such a dynamic recommender system? 3) Can the existing SOTA static method alleviate the problem in the dynamic environment? Concretely, we conduct a comprehensive data-driven study through simulation experiments of political polarization in news recommendations based on 40,000 annotated news articles. We find that users are rapidly exposed to more extreme content as the recommender evolves. We also find that a calibration-based intervention can slow down this polarization, but leaves open significant opportunities for future improvements	Han Zhang, James Caverlee, Ziwei Zhu	George Mason University; Texas A&M University
85	Augmenting Graph Convolutional Networks with Textual Data for Recommendations	Graph Convolutional Networks have recently shown state-of-the-art performance for collaborative filtering-based recommender systems. However, many systems use a pure user-item bipartite interaction graph, ignoring available additional information about the items and users. This paper proposes an effective and general method, TextGCN, that utilizes rich textual information about the graph nodes, specifically user reviews and item descriptions, using pre-trained text embeddings. We integrate those reviews and descriptions into item recommendations to augment graph embeddings obtained using LightGCN, a SOTA graph network. Our model achieves a 7–23% statistically significant improvement over this SOTA baseline when evaluated on several diverse large-scale review datasets. Furthermore, our method captures semantic signals from the text, which are not available when using graph connections alone.	Eugene Agichtein, Marcus D. Collins, Oleg Rokhlenko, Sergey Volokhin	Amazon; Emory University
86	BioASQ at CLEF2023: The Eleventh Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge		Anastasia Krithara, Anastasios Nentidis, Eulàlia FarréMaduell, Georgios Paliouras, Martin Krallinger, Salvador LimaLópez
87	TourismNLG: A Multi-lingual Generative Benchmark for the Tourism Domain	The tourism industry is important for the benefits it brings and due to its role as a commercial activity that creates demand and growth for many more industries. Yet there is not much work on data science problems in tourism. Unfortunately, there is not even a standard benchmark for evaluation of tourism-specific data science tasks and models. In this paper, we propose a benchmark, TourismNLG, of five natural language generation (NLG) tasks for the tourism domain and release corresponding datasets with standard train, validation and test splits. Further, previously proposed data science solutions for tourism problems do not leverage the recent benefits of transfer learning. Hence, we also contribute the first rigorously pretrained mT5 and mBART model checkpoints for the tourism domain. The models have been pretrained on four tourism-specific datasets covering different aspects of tourism. Using these models, we present initial baseline results on the benchmark tasks. We hope that the dataset will promote active research for natural language generation for travel and tourism. ( https://drive.google.com/file/d/1tux19cLoXc1gz9Jwj9VebXmoRvF9MF6B/ .)	Manish Gupta, Manish Shrivastava, Omkar Gurjar, Sahaj Agarwal, Sahil Manoj Bhatt	IIIT-Hyderabad; Microsoft
88	An Interpretable Knowledge Representation Framework for Natural Language Processing with Cross-Domain Application	Data representation plays a crucial role in natural language processing (NLP), forming the foundation for most NLP tasks. Indeed, NLP performance highly depends upon the effectiveness of the preprocessing pipeline that builds the data representation. Many representation learning frameworks, such as Word2Vec, encode input data based on local contextual information that interconnects words. Such approaches can be computationally intensive, and their encoding is hard to explain. We here propose an interpretable representation learning framework utilizing Tsetlin Machine (TM). The TM is an interpretable logic-based algorithm that has exhibited competitive performance in numerous NLP tasks. We employ the TM clauses to build a sparse propositional (boolean) representation of natural language text. Each clause is a class-specific propositional rule that links words semantically and contextually. Through visualization, we illustrate how the resulting data representation provides semantically more distinct features, better separating the underlying classes. As a result, the following classification task becomes less demanding, benefiting simple machine learning classifiers such as Support Vector Machine (SVM). We evaluate our approach using six NLP classification tasks and twelve domain adaptation tasks. Our main finding is that the accuracy of our proposed technique significantly outperforms the vanilla TM, approaching the competitive accuracy of deep neural network (DNN) baselines. Furthermore, we present a case study showing how the representations derived from our framework are interpretable. (We use an asynchronous and parallel version of Tsetlin Machine: available at https://github.com/cair/PyTsetlinMachineCUDA ).	Bimal Bhattarai, Lei Jiao, OleChristoffer Granmo	Univ Agder, Ctr AI Res, Grimstad, Norway
89	Bootstrapped nDCG Estimation in the Presence of Unjudged Documents	Retrieval studies often reuse TREC collections after the corresponding tracks have passed. Yet, a fair evaluation of new systems that retrieve documents outside the original judgment pool is not straightforward. Two common ways of dealing with unjudged documents are to remove them from a ranking (condensed lists), or to treat them as non- or highly relevant (naïve lower and upper bounds). However, condensed list-based measures often overestimate the effectiveness of a system, and naïve bounds are often very “loose”—especially for nDCG when some top-ranked documents are unjudged. As a new alternative, we employ bootstrapping to generate a distribution of nDCG scores by sampling judgments for the unjudged documents using run-based and/or pool-based priors. Our evaluation on four TREC collections with real and simulated cases of unjudged documents shows that bootstrapped nDCG scores yield more accurate predictions than condensed lists, and that they are able to strongly tighten upper bounds at a negligible loss of accuracy.	Lukas Gienapp, Maik Fröbe, Martin Potthast, Matthias Hagen	Friedrich-Schiller-Universität Jena; Leipzig University and ScaDS.AI
90	Domain-Driven and Discourse-Guided Scientific Summarisation	Scientific articles tend to follow a standardised discourse that enables a reader to quickly identify and extract useful or important information. We hypothesise that such structural conventions are strongly influenced by the scientific domain (e.g., Computer Science, Chemistry, etc.) and explore this through a novel extractive algorithm that utilises domain-specific discourse information for the task of abstract generation. In addition to being both simple and lightweight, the proposed algorithm constructs summaries in a structured and interpretable manner. In spite of these factors, we show that our approach outperforms strong baselines on the arXiv scientific summarisation dataset in both automatic and human evaluations, confirming that a scientific article’s domain strongly influences its discourse structure and can be leveraged to effectively improve its summarisation. Our code can be found at: https://github.com/TGoldsack1/DodoRank .	Carolina Scarton, Chenghua Lin, Tomas Goldsack, Zhihao Zhang	Beihang Univ, Beijing, Peoples R China; Univ Sheffield, Sheffield, S Yorkshire, England
91	Intention-Aware Neural Networks for Question Paraphrase Identification	We tackle Question Paraphrasing Identification (QPI), a task of determining whether a pair of interrogative sentences (i.e., questions) are paraphrases of each other, which is widely applied in information retrieval and question answering. It is challenging to identify the distinctive instances which are similar in semantics though holding different intentions. In this paper, we propose an intention-aware neural model for QPI. Question words (e.g., “when”) and blocks (e.g., “what time”) are extracted as features for revealing intentions. They are utilized to regulate pairwise question encoding explicitly and implicitly, within Conditional Variational AutoEncoder (CVAE) and multi-task VAE frameworks, respectively. We conduct experiments on the benchmark corpora QQP, LCQMC and BQ, towards both English and Chinese QPI tasks. Experimental results show that our method yields generally significant improvements compared to a variety of PLM-based baselines (BERT, RoBERTa and ERNIE), and it outperforms the state-of-the-art QPI models. It is also proven that our method doesn’t severely reduce the overall efficiency, which merely extends the training time by 12.5% on a RTX3090. All the models and source codes will be made publicly available to support reproducible research.	Guodong Zhou, Jianmin Yao, Rui Peng, Yu Hong, Zhiling Jin	Soochow University
92	Document-Level Relation Extraction with Distance-Dependent Bias Network and Neighbors Enhanced Loss	Document-level relation extraction (DocRE), in contrast to sentence-level, requires additional context to be considered. Recent studies, when extracting contextual information about entities, treat information about the whole document equally, which inevitably suffers from irrelevant information. This has been demonstrated to make the model not robust: it predicts correctly when an entire document is fed but errs when non-evidence sentences are removed. In this work, we propose three novel components to improve the robustness of the model by selectively considering the context of the entities. Firstly, we propose a new method for computing the distance between tokens that reduces the distance between evidence sentences and entities. Secondly, we add a distance-dependent bias network to each self-attention building block to exploit the distance information between tokens. Finally, we design an auxiliary loss for entities with higher attention to close tokens in the attention mechanism. Experimental results on three DocRE benchmark datasets show that our model not only outperforms existing models but also has strong robustness.	Hao Liang, Qifeng Zhou	Xiamen Univ, Dept Automat, Xiamen 361005, Peoples R China
93	Stat-Weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis Testing	Interleaving is an online evaluation approach for information retrieval systems that compares the effectiveness of ranking functions in interpreting the users' implicit feedback. Previous work such as Hofmann et al. (2011) [11] has evaluated the most promising interleaved methods at the time, on uniform distributions of queries. In the real world, usually, there is an unbalanced distribution of repeated queries that follows a long-tailed users' search demand curve. This paper first aims to reproduce the Team Draft Interleaving accuracy evaluation on uniform query distributions [11] and then focuses on assessing how this method generalises to long-tailed real-world scenarios. The replicability work raised interesting considerations on how the winning ranking function for each query should impact the overall winner for the entire evaluation. Based on what was observed, we propose that not all the queries should contribute to the final decision in equal proportion. As a result of these insights, we designed two variations of the Delta(AB) score winner estimator that assign to each query a credit based on statistical hypothesis testing. To reproduce, replicate and extend the original work, we have developed from scratch a system that simulates a search engine and users' interactions from datasets from the industry. Our experiments confirm our intuition and show that our methods are promising in terms of accuracy, sensitivity, and robustness to noise.	Alessandro Benedetti, Mario A. Ruggero	Sease Ltd, London, England
94	PyGaggle: A Gaggle of Resources for Open-Domain Question Answering	Text retrieval using dense–sparse hybrids has been gaining popularity because of their effectiveness. Improvements to both sparse and dense models have also been noted, in the context of open-domain question answering. However, the increasing sophistication of proposed techniques places a growing strain on the reproducibility of results. Our work aims to tackle this challenge. In Generation Augmented Retrieval (GAR), a sequence-to-sequence model was used to generate candidate answer strings as well as titles of documents and actual sentences where the answer string might appear; this query expansion was applied before traditional sparse retrieval. Distilling Knowledge from Reader to Retriever (DKRR) used signals from downstream tasks to train a more effective Dense Passage Retrieval (DPR) model. In this work, we first replicate the results of GAR using a different codebase and leveraging a more powerful sequence-to-sequence model, T5. We provide tight integration with Pyserini, a popular IR toolkit, where we also add support for the DKRR-based DPR model: the combination demonstrates state-of-the-art effectiveness for retrieval in open-domain QA. To account for progress in generative readers that leverage evidence fusion for QA, so-called fusion-in-decoder (FiD), we incorporate these models into our PyGaggle toolkit. The result is a reproducible, easy-to-use, and powerful end-to-end question-answering system that forms a starting point for future work. Finally, we provide evaluation tools that better gauge whether models are generalizing or simply memorizing.	Haonan Chen, Jimmy Lin, Lingwei Gu, Manveer Singh Tamber, Ronak Pradeep	University of Waterloo
95	Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering	One of the contributions of the landmark Dense Passage Retriever (DPR) work is the curation of a corpus of passages generated from Wikipedia articles that have been segmented into non-overlapping passages of 100 words. This corpus has served as the standard source for question answering systems based on a retriever–reader pipeline and provides the basis for nearly all state-of-the-art results on popular open-domain question answering datasets. There are, however, multiple potential drawbacks to this corpus. First, the passages do not include tables, infoboxes, and lists. Second, the choice to split articles into non-overlapping passages results in fragmented sentences and disjoint passages that models might find hard to reason over. In this work, we experimented with multiple corpus variants from the same Wikipedia source, differing in passage size, overlapping passages, and the inclusion of linearized semi-structured data. The main contribution of our work is the replication of Dense Passage Retriever and Fusion-in-Decoder training on our corpus variants, allowing us to validate many of the findings in previous work and giving us new insights into the importance of corpus pre-processing for open-domain question answering. With better data preparation, we see improvements of over one point on both the Natural Questions dataset and the TriviaQA dataset in end-to-end effectiveness over previous work measured using the exact match score. Our results demonstrate the importance of careful corpus curation and provide the basis for future work.	Jimmy Lin, Manveer Singh Tamber, Ronak Pradeep	Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON, Canada
96	Exploring Tabular Data Through Networks	Representing and visualizing data as networks is a widely spread approach to analyzing highly connected data in domains such as medicine, social sciences, and information retrieval. Investigating data as networks requires pre-processing, retrieval or filtering, conversion of data into networks, and application of various network analysis approaches. These processes are usually complex and hard to perform without some programming knowledge and resources. To the best of our knowledge, most solutions attempting to make these functionalities accessible to users focus on particular processes in isolation without exploring how these processes could be further abstracted or combined in a real-world application to assist users in their data exploration and knowledge extraction journey. Furthermore, most applications focusing on such approaches tend to be closed-source. This paper introduces a solution that combines the approaches above as part of Collaboration Spotting X (CSX), an open-source network-based visual analytics tool for retrieving, modeling, and exploring or analyzing data as networks. It abstracts the concepts above through the use of multiple interactive visualizations. In addition to being an easily accessible open-source platform for data exploration and analysis, CSX can also serve as a real-world evaluation platform for researchers in related computer science areas who wish to test their solutions and approaches to machine learning, visualizations, interactions, and more in a real-world system.	Aleksandar Bobic, Christian Gütl, JeanMarie Le Goff	CERN, IPT Dept, Geneva, Switzerland; Graz Univ Technol, CoDiS Lab ISDS, Graz, Austria
97	TweetStream2Story: Narrative Extraction from Tweets in Real Time	The rise of social media has brought a great transformation to the way news are discovered and shared. Unlike traditional news sources, social media allows anyone to cover a story. Therefore, sometimes an event is already discussed by people before a journalist turns it into a news article. Twitter is a particularly appealing social network for discussing events, since its posts are very compact and, therefore, contain colloquial language and abbreviations. However, its large volume of tweets also makes it impossible for a user to keep up with an event. In this work, we present TweetStream2Story, a web app for extracting narratives from tweets posted in real time, about a topic of choice. This framework can be used to provide new information to journalists or be of interest to any user who wishes to stay up-to-date on a certain topic or ongoing event. As a contribution to the research community, we provide a live version of the demo, as well as its source code.	Alípio Jorge, Mafalda Castro, Ricardo Campos	INESC TEC
98	Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data	Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.	Alexander Flick, Felix Biessmann, Ivana Trajanovska, Sebastian Jäger	Berlin University of Applied Sciences and Technology
99	SOPalign: A Tool for Automatic Estimation of Compliance with Medical Guidelines	SOPalign is a tool designed for hospitals and other healthcare providers in the Netherlands to automatically estimate the compliance of internal standard operating procedures (SOPs) for employees with the national guidelines. In this tool, users can upload the SOPs of their hospital and the recommendations from the most recent guidelines. SOPalign will then link the individual recommendations from the guidelines to the relevant passages of text in the SOPs and determine whether these passages are compliant with the recommendations. To link the SOP passages to the recommendations from the guideline, we make use of a Semantic Textual Similarity (STS) model based on the siamese BERT-network architecture. For efficiency reasons, we only apply the STS model to sentences that exceed a threshold in n-gram cosine similarity. To estimate compliance of SOPs with guideline recommendations, we have fine-tuned pre-trained language models using two different Dutch Natural Language Inference (NLI) datasets.	Arjen P. de Vries, Heiman Wertheim, Luke van Leijenhorst, Thera Habben Jansen	Department of Infection Prevention and Control, Amphia Hospital; Department of Medical Microbiology, Radboudumc; Radboud University
100	Monitoring Online Discussions and Responses to Support the Identification of Misinformation	Misinformation prospers on online social networks and impacts society in various aspects. They spread rapidly online; therefore, it is crucial to keep track of any information that could potentially be false as early as possible. Many efforts have focused on detecting and eliminating misinformation using machine learning methods. Our proposed framework aims to leverage the strength of human roles engaging with a machine learning tool, providing a monitoring tool to identify the risk of misinformation on Twitter at an early stage. Specifically, this work is interested in a visualisation tool that prioritises popular Twitter topics and analyses the responses of the higher-risk topics through stance classification. Besides tackling the challenging task of stance classification, this work also aims to explore features within the information from Twitter that could provide further aspects of a response to a topic using sentiment analysis. The main objective is to provide an engaging tool for people who are also working towards the issue of online misinformation, i.e., fact-checkers in identifying and managing the risk of a specific topic at an early stage by taking appropriate actions towards it before the consequences worsen.	Xin Yu Liew	Univ Nottingham, Sch Comp Sci, Jubilee Campus,Wollaton Rd, Nottingham NG8 1BB, England
101	The CLEF-2023 CheckThat! Lab: Checkworthiness, Subjectivity, Political Bias, Factuality, and Authority	The five editions of the CheckThat! lab so far have focused on the main tasks of the information verification pipeline: check-worthiness, evidence retrieval and pairing, and verification. The 2023 edition of the lab zooms into some of the problems and—for the first time—it offers five tasks in seven languages (Arabic, Dutch, English, German, Italian, Spanish, and Turkish): Task 1 asks to determine whether an item, text or a text plus an image, is check-worthy; Task 2 requires to assess whether a text snippet is subjective or not; Task 3 looks for estimating the political bias of a document or a news outlet; Task 4 requires to determine the level of factuality of a document or a news outlet; and Task 5 is about identifying authorities that should be trusted to verify a contended claim.	Alberto BarrónCedeño, Andrea Galassi, Dilshod Azizov, Fatima Haouari, Federico Ruggeri, Firoj Alam, Giovanni Da San Martino, Gullal S. Cheema, Julia Maria Struß, Preslav Nakov, Rabindra Nath Nandi, Tamer Elsayed, Tommaso Caselli	BJIT Ltd, Dhaka, Bangladesh; HBKU, Qatar Comp Res Inst, Ar Rayyan, Qatar; Mohamed bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates; Qatar Univ, Doha, Qatar; TIB Leibniz Informat Ctr Sci & Technol, Hannover, Germany; Univ Appl Sci Potsdam, Potsdam, Germany; Univ Bologna, Bologna, Italy; Univ Groningen, Groningen, Netherlands; Univ Padua, Padua, Italy
102	Overview of PAN 2023: Authorship Verification, Multi-author Writing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection - Extended Abstract		Annina Heini, Benno Stein, Efstathios Stamatatos, Erik Körner, Eva Zangerle, Francisco Rangel, Janek Bevendorff, Krzysztof Kredens, Magdalena Wolska, Mara ChineaRios, Marc FrancoSalvador, Martin Potthast, Matti Wiegmann, Maximilian Mayerl, Paolo Rosso, Piotr Pezik
103	Fragmented Visual Attention in Web Browsing: Weibull Analysis of Item Visit Times	Users often browse the web in an exploratory way, inspecting what they find interesting without a specific goal. However, the temporal dynamics of visual attention during such sessions, emerging when users gaze from one item to another, are not well understood. In this paper, we examine how people distribute visual attention among content items when browsing news. Distribution of visual attention is studied in a controlled experiment, wherein eye-tracking data and web logs are collected for 18 participants exploring newsfeeds in a single- and multi-column layout. Behavior is modeled using Weibull analysis of item (article) visit times, which describes these visits via quantities like durations and frequencies of switching focused item. Bayesian inference is used to quantify uncertainty. The results suggest that visual attention in browsing is fragmented, and affected by the number, properties and composition of the items visible on the viewport. We connect these findings to previous work explaining information-seeking behavior through cost-benefit judgments.	Aini Putkonen, Antti Oulasvirta, Aurélien Nioche, Crista Kuuramo, Markku Laine	Aalto Univ, Dept Informat & Commun Engn, Espoo, Finland; Univ Glasgow, Sch Comp Sci, Glasgow, Scotland; Univ Helsinki, Dept Psychol & Logoped, Helsinki, Finland
104	Domain-Aligned Data Augmentation for Low-Resource and Imbalanced Text Classification	Data Augmentation approaches often use Language Models, pretrained on large quantities of unlabeled generic data, to conditionally generate examples. However, the generated data can be of subpar quality and struggle to maintain the same characteristics as the original dataset. To this end, we propose a Data Augmentation method for low-resource and imbalanced datasets, by aligning Language Models to in-domain data prior to generating synthetic examples. In particular, we propose the alignment of existing generic models in task-specific unlabeled data, in order to create better synthetic examples and boost performance in Text Classification tasks. We evaluate our approach on three diverse and well-known Language Models, four datasets, and two settings (i.e. imbalance and low-resource) in which Data Augmentation is usually deployed, and study the correlation between the amount of data required for alignment, model size, and its effects in downstream in-domain and out-of-domain tasks. Our results showcase that in-domain alignment helps create better examples and increase the performance in Text Classification. Furthermore, we find a positive connection between the number of training parameters in Language Models, the volume of fine-tuning data, and their effects in downstream tasks.	Despoina Chatzakou, Ioannis Kompatsiaris, Nikolaos Stylianou, Stefanos Vrochidis, Theodora Tsikrika	Inst Informat Technol, Ctr Res & Technol Hellas, Thessaloniki, Greece
105	Multimodal Geolocation Estimation of News Photos	The widespread growth of multimodal news requires sophisticated approaches to interpret content and relations of different modalities. Images are of utmost importance since they represent a visual gist of the whole news article. For example, it is essential to identify the locations of natural disasters for crisis management or to analyze political or social events across the world. In some cases, verifying the location(s) claimed in a news article might help human assessors or fact-checking efforts to detect misinformation, i.e., fake news. Existing methods for geolocation estimation typically consider only a single modality, e.g., images or text. However, news images can lack sufficient geographical cues to estimate their locations, and the text can refer to various possible locations. In this paper, we propose a novel multimodal approach to predict the geolocation of news photos. To enable this approach, we introduce a novel dataset called Multimodal Geolocation Estimation of News Photos ( MMG-NewsPhoto ). MMG-NewsPhoto is, so far, the largest dataset for the given task and contains more than half a million news texts with the corresponding image, out of which 3000 photos were manually labeled for the photo geolocation based on information from the image-text pairs. For a fair comparison, we optimize and assess state-of-the-art methods using the new benchmark dataset. Experimental results show the superiority of the multimodal models compared to the unimodal approaches.	Eric MüllerBudack, Golsa Tahmasebzadeh, Ralph Ewerth, Sherzod Hakimov	TIB Leibniz Informat Ctr Sci & Technol, Hannover, Germany; Univ Potsdam, Computat Linguist, Potsdam, Germany
106	Clustering of Bandit with Frequency-Dependent Information Sharing	In today’s business marketplace, the great demand for developing intelligent interactive recommendation systems is growing rapidly, which sequentially suggest users proper items by accurately predicting their preferences, while receiving up-to-date feedback to promote the overall performance. Multi-armed bandit, which has been widely applied to various online systems, is quite capable of delivering such efficient recommendation services. To further enhance online recommendations, many works have introduced clustering techniques to fully utilize users’ information. These works consider symmetric relations between users, i.e., users in one cluster share equal weights. However, in practice, users usually have different interaction frequency (i.e., activeness) in one cluster, and their collaborative relations are unsymmetrical. This brings a challenge for bandit clustering since inactive users lack the capability of leveraging these interaction information to mitigate the cold-start problem, and further affect active ones belonging to one cluster. In this work, we explore user activeness and propose a frequency-dependent clustering of bandit model to deal with the aforementioned challenge. The model learns representation of each user’s cluster by sharing collaborative information weighed based on user activeness, i.e., inactive users can utilize the collaborative information from active ones in the same cluster to optimize the cold start process. Extensive studies have been carefully conducted on both synthetic data and two real-world datasets indicating the efficiency and effectiveness of our proposed model.	Qifeng Zhou, Qing Wang, Shen Yang	IBM T J Watson Res Ctr, Intelligent IT Operat, New York, NY USA; Xiamen Univ, Dept Automat, Xiamen, Peoples R China
107	Improving Neural Topic Models with Wasserstein Knowledge Distillation	Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher. The distilled model also outperforms several other competitive topic models on topic coherence.	Debarshi Kumar Sanyal, Suman Adhya	Indian Association for the Cultivation of Science
108	Exploring Fake News Detection with Heterogeneous Social Media Context Graphs	Fake news detection has become a research area that goes way beyond a purely academic interest as it has direct implications on our society as a whole. Recent advances have primarily focused on textbased approaches. However, it has become clear that to be effective one needs to incorporate additional, contextual information such as spreading behaviour of news articles and user interaction patterns on social media. We propose to construct heterogeneous social context graphs around news articles and reformulate the problem as a graph classification task. Exploring the incorporation of different types of information (to get an idea as to what level of social context is most effective) and using different graph neural network architectures indicates that this approach is highly effective with robust results on a common benchmark dataset.	Gregor Donabauer, Udo Kruschwitz	University of Regensburg
109	Where a Little Change Makes a Big Difference: A Preliminary Exploration of Children's Queries	This paper contributes to the discussion initiated in a recent SIGIR paper describing a gap in the information retrieval (IR) literature on query understanding-where they come from and whether they serve their purpose. Particularly the connection between query variability and search engines regarding consistent and equitable access to all users. We focus on a user group typically underserved: children. Using preliminary experiments (based on logs collected in the classroom context) and arguments grounded in children IR literature, we emphasize the importance of dedicating research efforts to interpreting queries formulated by children and the information needs they elicit. We also outline open problems and possible research directions to advance knowledge in this area, not just for children but also for other often-overlooked user groups and contexts.	Emiliana Murgia, Maria Soledad Pera, Mohammad Aliannejadi, Monica Landoni, Theo Huibers	Delft Univ Technol, Web Informat Syst, Delft, Netherlands; Univ Amsterdam, Amsterdam, Netherlands; Univ Milan, Bicocca, Italy; Univ Svizzera Italiana, Lugano, Switzerland; Univ Twente, Enschede, Netherlands
110	Towards Detecting Interesting Ideas Expressed in Text	In recent years, product and project ideas are often sourced from public competitions, where anyone can enter their own solutions to an open-ended question. While copious ideas can be gathered in this way, it becomes difficult to find the most promising results among all entries. This paper explores the potential of automating the detection of interesting ideas and studies the effect of various features of ideas on the prediction task. A BERT-based model is built to rank ideas by their predicted interestingness, using text embeddings from idea descriptions and the concreteness, novelty as well as the uniqueness of ideas. The model is trained on a dataset of OpenIDEO idea competitions. The results show that language models can be used to speed up finding promising ideas, but care must be taken in choosing a suitable dataset.	Adam Jatowt, Bela Pfahl	Univ Innsbruck, Innsbruck, Austria
111	Trigger or not Trigger: Dynamic Thresholding for Few Shot Event Detection	Recent studies in few-shot event trigger detection from text address the task as a word sequence annotation task using prototypical networks. In this context, the classification of a word is based on the similarity of its representation to the prototypes built for each event type and for the “non-event” class (also named null class). However, the “non-event” prototype aggregates by definition a set of semantically heterogeneous words, which hurts the discrimination between trigger and non-trigger words. We address this issue by handling the detection of non-trigger words as an out-of-domain (OOD) detection problem and propose a method for dynamically setting a similarity threshold to perform this detection. Our approach increases f-score by about 10 points on average compared to the state-of-the-art methods on three datasets.	Aboubacar Tuo, Julien Tourille, Olivier Ferret, Romaric Besançon	Univ Paris Saclay, CEA, List, F-91120 Palaiseau, France
112	Utilising Twitter Metadata for Hate Classification	Social media has become an essential daily feature of people's lives. Social media platforms provide individuals wishing to cause harm with an open, anonymous, and far-reaching channel. As a result, society is experiencing a crisis concerning hate and abuse on social media. This paper aims to provide a better method of identifying these instances of hate via a custom BERT classifier which leverages readily available metadata from Twitter alongside traditional text data. With Accuracy, F1, Recall and Precision scores of 0.85, 0.75, 0.76, and 0.74, the new model presents a competitive performance compared to similar state-of-the-art models. The increased performance of models within this domain can only benefit society as they provide more effective means to combat hate on social media.	Jan Breitsohl, Joemon M. Jose, Oliver Warke	Univ Glasgow, Adam Smith Business Sch, Glasgow, Scotland; Univ Glasgow, Sch Comp Sci, Glasgow, Scotland
113	Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers	The analysis of emotions expressed in text has numerous applications. In contrast to categorical analysis, focused on classifying emotions according to a pre-defined set of common classes, dimensional approaches can offer a more nuanced way to distinguish between different emotions. Still, dimensional methods have been less studied in the literature. Considering a valence-arousal dimensional space, this work assesses the use of pre-trained Transformers to predict these two dimensions on a continuous scale, with input texts from multiple languages and domains. We specifically combined multiple annotated datasets from previous studies, corresponding to either emotional lexica or short text documents, and evaluated models of multiple sizes and trained under different settings. Our results show that model size can have a significant impact on the quality of predictions, and that by fine-tuning a large model we can confidently predict valence and arousal in multiple languages. We make available the code, models, and supporting data.	Bruno Martins, Gonçalo Azevedo Mendes	Univ Lisbon, Inst Super Tecn, Lisbon, Portugal
114	Multilingual Detection of Check-Worthy Claims Using World Languages and Adapter Fusion	Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims emerging globally in multiple languages. (1) With a vast number of annotators available for world languages and the storage-efficient adapter models, this approach is more cost efficient. Models can be updated more frequently and thus stay up-to-date. (2) Adapter fusion provides insights and allows for interpretation regarding the influence of each adapter model on a particular language. The proposed solution often outperformed the top multilingual approaches in our benchmark tasks.	Ipek Baris Schlicht, Lucie Flek, Paolo Rosso	Univ Marburg, CAISA Lab, Frankfurt, Germany; Univ Politecn Valencia, PRHLT Res Ctr, Valencia, Spain
115	A Knowledge Infusion Based Multitasking System for Sarcasm Detection in Meme	In this paper, we hypothesize that sarcasm detection is closely associated with the emotion present in memes. Thereafter, we propose a deep multitask model to perform these two tasks in parallel, where sarcasm detection is treated as the primary task, and emotion recognition is considered an auxiliary task. We create a large-scale dataset consisting of 7416 memes in Hindi, one of the widely spoken languages. We collect the memes from various domains, such as politics, religious, racist, and sexist, and manually annotate each instance with three sarcasm categories, i.e., i) Not Sarcastic, ii) Mildly Sarcastic or iii) Highly Sarcastic and 13 fine-grained emotion classes. Furthermore, we propose a novel Knowledge Infusion (KI) based module which captures sentiment-aware representation from a pre-trained model using the Memotion dataset. Detailed empirical evaluation shows that the multitasking model performs better than the single-task model. We also show that using this KI module on top of our model can boost the performance of sarcasm detection in both single-task and multi-task settings even further. Code and dataset are available at this link: https://www. iitp.ac.in/ ai-nlp-ml/resources.html#Sarcastic-Meme-Detection .	Arindam Chatterjee, Asif Ekbal, Dibyanayan Bandyopadhyay, Gitanjali Kumari, Santanu Pal, Vinutha BN	Indian Institute of Technology Patna; Wipro AI Labs
116	It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers	Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset.	Adrian Cosma, AnaMaria Bucur, Liviu P. Dinu, Paolo Rosso	Univ Bucharest, Fac Math & Comp Sci, Bucharest, Romania; Univ Bucharest, Interdisciplinary Sch Doctoral Studies, Bucharest, Romania; Univ Politecn Valencia, PRHLT Res Ctr, Valencia, Spain; Univ Politehn Bucuresti, Bucharest, Romania
117	CoLISA: Inner Interaction via Contrastive Learning for Multi-choice Reading Comprehension	Multi-choice reading comprehension (MC-RC) is supposed to select the most appropriate answer from multiple candidate options by reading and comprehending a given passage and a question. Recent studies dedicate to catching the relationships within the triplet of passage, question, and option. Nevertheless, one limitation in current approaches relates to the fact that confusing distractors are often mistakenly judged as correct, due to the fact that models do not emphasize the differences between the answer alternatives. Motivated by the way humans deal with multi-choice questions by comparing given options, we propose CoLISA (Contrastive Learning and In-Sample Attention), a novel model to prudently exclude the confusing distractors. In particular, CoLISA acquires option-aware representations via contrastive learning on multiple options. Besides, in-sample attention mechanisms are applied across multiple options so that they can interact with each other. The experimental results on QuALITY and RACE demonstrate that our proposed CoLISA pays more attention to the relation between correct and distractive options, and recognizes the discrepancy between them. Meanwhile, CoLISA also reaches the state-of-the-art performance on QuALITY (Our code is available at https://github.com/Walle1493/CoLISA. .).	Bowei Zou, Mengxing Dong, Yanling Li, Yu Hong	Inst Infocomm Res, Singapore, Singapore; Soochow Univ, Comp Sci & Technol, Suzhou, Peoples R China
118	Predicting the Listening Contexts of Music Playlists Using Knowledge Graphs	Playlists are a major way of interacting with music, as evidenced by the fact that streaming services currently host billions of playlists. In this content overload scenario, it is crucial to automatically characterise playlists, so that music can be effectively organised, accessed and retrieved. One way to characterise playlists is by their listening context. For example, one listening context is “workout”, which characterises playlists suited to be listened to by users while working out. Recent work attempts to predict the listening contexts of playlists, formulating the problem as multi-label classification. However, current classifiers for listening context prediction are limited in the input data modalities that they handle, and on how they leverage the inputs for classification. As a result, they achieve only modest performance. In this work, we propose to use knowledge graphs to handle multi-modal inputs, and to effectively leverage such inputs for classification. We formulate four novel classifiers which yield approximately 10% higher performance than the state-of-the-art. Our work is a step forward in predicting the listening contexts of playlists, which could power important real-world applications, such as context-aware music recommender systems and playlist retrieval systems.	Derek G. Bridge, Giovanni Gabbolini	Univ Coll Cork, Sch Comp Sci & IT, Insight Ctr Data Analyt, Cork, Ireland
119	A Mask-Based Logic Rules Dissemination Method for Sentiment Classifiers	Disseminating and incorporating logic rules inspired by domain knowledge in Deep Neural Networks (DNNs) is desirable to make their output causally interpretable, reduce data dependence, and provide some human supervision during training to prevent undesirable outputs. Several methods have been proposed for that purpose but performing end-to-end training while keeping the DNNs informed about logical constraints remains a challenging task. In this paper, we propose a novel method to disseminate logic rules in DNNs for Sentence-level Binary Sentiment Classification. In particular, we couple a Rule-Mask Mechanism with a DNN model which given an input sequence predicts a vector containing binary values corresponding to each token that captures if applicable a linguistically motivated logic rule on the input sequence. We compare our method with a number of state-of-the-art baselines and demonstrate its effectiveness. We also release a new Twitter-based dataset specifically constructed to test logic rule dissemination methods and propose a new heuristic approach to provide automatic high-quality labels for the dataset.	Antonio RoblesKelly, Mohamed Reda Bouadjenek, Shashank Gupta	Deakin Univ, Sch Informat Technol, Waurn Ponds Campus, Geelong, Vic 3216, Australia; Def Sci & Technol Grp, Edinburg, SA 5111, Australia
120	Injecting Temporal-Aware Knowledge in Historical Named Entity Recognition	In this paper, we address the detection of named entities in multilingual historical collections. We argue that, besides the multiple challenges that depend on the quality of digitization (e.g., misspellings and linguistic errors), historical documents can pose another challenge due to the fact that such collections are distributed over a long enough period of time to be affected by changes and evolution of natural language. Thus, we consider that detecting entities in historical collections is time-sensitive, and explore the inclusion of temporality in the named entity recognition (NER) task by exploiting temporal knowledge graphs. More precisely, we retrieve semantically-relevant additional contexts by exploring the time information provided by historical data collections and include them as mean-pooled representations in a Transformer-based NER model. We experiment with two recent multilingual historical collections in English, French, and German, consisting of historical newspapers (19C-20C) and classical commentaries (19C). The results are promising and show the effectiveness of injecting temporal-aware knowledge into the different datasets, languages, and diverse entity types.	Ahmed Hamdi, Antoine Doucet, CarlosEmiliano GonzálezGallardo, Edward Giamphy, Emanuela Boros, José G. Moreno	University of La Rochelle, L3i
121	Temporal Natural Language Inference: Evidence-Based Evaluation of Temporal Text Validity	It is important to learn whether text information remains valid or not for various applications including story comprehension, information retrieval, and user state tracking on microblogs and via chatbot conversations. It is also beneficial to deeply understand the story. However, this kind of inference is still difficult for computers as it requires temporal commonsense. We propose a novel task, Temporal Natural Language Inference, inspired by traditional natural language reasoning to determine the temporal validity of text content. The task requires inference and judgment whether an action expressed in a sentence is still ongoing or rather completed, hence, whether the sentence still remains valid, given its supplementary content. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives hints on temporal commonsense knowledge. Using prepared dataset, we introduce a new machine learning model that incorporates the information from the knowledge base and demonstrate that our model outperforms state-of-the-art approaches in the proposed task.	Adam Jatowt, Kazunari Sugiyama, Taishi Hosokawa	Kyoto University; Osaka Seikei University; University of Innsbruck
122	Theoretical Analysis on the Efficiency of Interleaved Comparisons	This study presents a theoretical analysis on the efficiency of interleaving, an efficient online evaluation method for rankings. Although interleaving has already been applied to production systems, the source of its high efficiency has not been clarified in the literature. Therefore, this study presents a theoretical analysis on the efficiency of interleaving methods. We begin by designing a simple interleaving method similar to ordinary interleaving methods. Then, we explore a condition under which the interleaving method is more efficient than A/B testing and find that this is the case when users leave the ranking depending on the item's relevance, a typical assumption made in click models. Finally, we perform experiments based on numerical analysis and user simulation, demonstrating that the theoretical results are consistent with the empirical results.	Hajime Morita, Kojiro Iizuka, Makoto P. Kato	Gunosy Inc, Shibuya, Japan; Univ Tsukuba, Tsukuba, Ibaraki, Japan
123	An Experimental Study on Pretraining Transformers from Scratch for IR	Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.	Carlos Lassance, Hervé Déjean, Stéphane Clinchant	Naver Labs Europe
124	A Transformer-Based Framework for POI-Level Social Post Geolocation	POI-level geo-information of social posts is critical to many location-based applications and services. However, the multi-modality, complexity, and diverse nature of social media data and their platforms limit the performance of inferring such fine-grained locations and their subsequent applications. To address this issue, we present a transformer-based general framework, which builds upon pre-trained language models and considers non-textual data, for social post geolocation at the POI level. To this end, inputs are categorized to handle different social data, and an optimal combination strategy is provided for feature representations. Moreover, a uniform representation of hierarchy is proposed to learn temporal information, and a concatenated version of encodings is employed to capture feature-wise positions better. Experimental results on various social media datasets demonstrate that the three variants of our proposed framework outperform multiple state-of-art baselines by a large margin in terms of accuracy and distance error metrics.	Junhua Liu, Kwan Hui Lim, Menglin Li, Teng Guo	Dalian Univ Technol, Dalian, Peoples R China; Singapore Univ Technol & Design, Singapore, Singapore
125	Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering	We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task, which leverages contextualized images in multimodal documents to generate visual pseudo-questions. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.	Camille Guinaudeau, Olivier Ferret, Paul Lerner	Université Paris-Saclay, CEA, List; Université Paris-Saclay, CNRS, LISN
126	Service Is Good, Very Good or Excellent? Towards Aspect Based Sentiment Intensity Analysis	Aspect-based sentiment analysis (ABSA) is a fast-growing research area in natural language processing (NLP) that provides more fine-grained information, considering the aspect as the fundamental item. The ABSA primarily measures sentiment towards a given aspect, but does not quantify the intensity of that sentiment. For example, intensity of positive sentiment expressed for service in service is good is comparatively weaker than in service is excellent. Thus, aspect sentiment intensity will assist the stakeholders in mining user preferences more precisely. Our current work introduces a novel task called aspect based sentiment intensity analysis (ABSIA) that facilitates research in this direction. An annotated review corpus for ABSIA is introduced by labelling the benchmark SemEval ABSA restaurant dataset with the seven (7) classes in a semi-supervised way. To demonstrate the effective usage of corpus, we cast ABSIA as a natural language generation task, where a natural sentence is generated to represent the output in order to utilize the pre-trained language models effectively. Further, we propose an effective technique for the joint learning where ABSA is used as a secondary task to assist the primary task, i.e. ABSIA. An improvement of 2 points is observed over the single task intensity model. To explain the actual decision process of the proposed framework, model explainability technique is employed that extracts the important opinion terms responsible for generation (Source code and the dataset has been made available on https://www.iitp.ac.in/~ai-nlp-ml/resources.html#ABSIA , https://github.com/20118/ABSIA )	Asif Ekbal, Mamta	IIT Patna
127	Effective Hierarchical Information Threading Using Network Community Detection	With the tremendous growth in the volume of information produced online every day (e.g. news articles), there is a need for automatic methods to identify related information about events as the events evolve over time (i.e., information threads). In this work, we propose a novel unsupervised approach, called HINT, which identifies coherent Hierarchical Information Threads. These threads can enable users to easily interpret a hierarchical association of diverse evolving information about an event or discussion. In particular, HINT deploys a scalable architecture based on network community detection to effectively identify hierarchical links between documents based on their chronological relatedness and answers to the 5W1H questions (i.e., who, what, where, when, why & how). On the NewSHead collection, we show that HINT markedly outperforms existing state-of-the-art approaches in terms of the quality of the identified threads. We also conducted a user study that shows that our proposed network-based hierarchical threads are significantly ( $p < 0.05$ ) preferred by users compared to cluster-based sequential threads.	Graham McDonald, Hitarth Narvala, Iadh Ounis	Univ Glasgow, Glasgow, Scotland
128	The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer	Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. [8] proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European languages. In contrast we experiment with a topologically diverse set of languages (Spanish, Russian, Vietnamese, and Hindi) and extend their original implementations to new tasks (XSR, NER, and QA) and an additional training regime (continual learning). Our study reproduced gains in NLI for four languages, showed improved NER, XSR, and cross-lingual QA results in three languages (though some cross-lingual QA gains were not statistically significant), while mono-lingual QA performance never improved and sometimes degraded. Analysis of distances between contextualized embeddings of related and unrelated words (across languages) showed that fine-tuning leads to “forgetting” some of the cross-lingual alignment information. Based on this observation, we further improved NLI performance using continual learning. Our software is publicly available https://github.com/pefimov/cross-lingual-adjustment .	Elena Arslanova, Leonid Boytsov, Pavel Braslavski, Pavel Efimov	Bosch Center for Artificial Intelligence; ITMO University; Ural Federal University
129	InfEval: Application for Object Detection Analysis	Object Detection is one of the most fundamental and challenging areas in computer vision. A detailed analysis and evaluation is key to understanding the performance of custom Deep Learning models. In this contribution, we present an application which is able to run inference on custom data for models created in different machine learning frameworks (e.g. TensorFlow, PyTorch), visualize the output and evaluate it in detail. Both, the Object Detection models and the data sets, are uploaded and executed locally without leaving the application. Numerous filtering options, for instance filtering on mAP , on NMS or on IoU , are provided.	Kirill Bogomasov, Stefan Conrad, Tim Geuer	Heinrich Heine Univ, Univ Str 1, D-40225 Dusseldorf, Germany
130	SimpleRad: Patient-Friendly Dutch Radiology Reports	Patients increasingly have access to their electronic health records. However, much of the content therein is not specifically written for them; instead it captures communication about a patient’s situation between medical professionals. We present SimpleRad, a prototype application to explore patient-friendly explanations of radiology terminology. In this demonstration paper, we describe the various modules currently included in SimpleRad such as an entity linker, summarizer, search page, and observation frequency estimator.	Arjen P. de Vries, Bram van Ginneken, Koen Dercksen	Radboud Univ Nijmegen, Med Ctr, Nijmegen, Netherlands; Radboud Univ Nijmegen, Nijmegen, Netherlands
131	Continuous Integration for Reproducible Shared Tasks with TIRA.io	A major obstacle to the long-term impact of most shared tasks is their lack of reproducibility. Often only the test collections and the papers of the organizers and participants are published. Third parties who want to independently evaluate the state of the art for a task on other data must re-implement the participants' software. The tools developed to collect software from participants in shared tasks only partially verify its reliability at the time of submission, much less long-term, and do not enable third parties to reuse it later. We have overhauled the TIRA Integrated Research Architecture to address all of these issues. The new version simplifies task setup for organizers and software submission for participants, scales from a local computer to the cloud, supports on-demand resource allocation up to parallel CPU and GPU processing, and enables export for local reproduction with just a few lines of code. This is achieved by implementing the TIRA protocol with an industry-standard continuous integration and deployment (CI/CD) pipeline using Git, Docker, and Kubernetes.	Bastian Grahm, Benno Stein, Frank Loebe, Maik Fröbe, Martin Potthast, Matthias Hagen, Matti Wiegmann, Nikolay Kolyada, Theresa Elstner	Bauhaus Univ Weimar, Weimar, Germany; Friedrich Schiller Univ Jena, Jena, Germany; Univ Leipzig, Leipzig, Germany
132	Text2Storyline: Generating Enriched Storylines from Text	In recent years, the amount of information generated, consumed and stored has grown at an astonishing rate, making it difficult for those seeking information to extract knowledge in good time. This has become even more important, as the average reader is not as willing to spare more time out of their already busy schedule as in the past, thus prioritizing news in a summarized format, which are faster to digest. On top of that, people tend to increasingly rely on strong visual components to help them understand the focal point of news articles in a less tiresome manner. This growing demand, focused on exploring information through visual aspects, urges the need for the emergence of alternative approaches concerned with text understanding and narrative exploration. This motivated us to propose Text2Storyline, a platform for generating and exploring enriched storylines from an input text, a URL or a user query. The latter is to be issued on the Portuguese Web Archive (Arquivo.pt), therefore giving users the chance to expand their knowledge and build up on information collected from web sources of the past. To fulfill this objective, we propose a system that makes use of the Time-Matters algorithm to filter out non-relevant dates and organize relevant content by means of different displays: ‘ Annotated Text ’, ‘ Entities ’, ‘ Storyline ’, ‘ Temporal Clustering ’ and ‘ Word Cloud ’. To extend the users’ knowledge, we rely on entity linking to connect persons, events, locations and concepts found in the text to Wikipedia pages, a process also known as Wikification. Each of the entities is then illustrated by means of an image collected from the Arquivo.pt.	Alípio Jorge, Francisco Gonçalves, Ricardo Campos	LIAAD INESCTEC, Porto, Portugal; Univ Porto, FCUP, Porto, Portugal
133	Clustering Without Knowing How To: Application and Evaluation	Clustering plays a crucial role in data mining, allowing convenient exploration of datasets and new dataset bootstrapping. However, it requires knowing the distances between objects, which are not always obtainable due to the formalization complexity or criteria subjectivity. Such problems are more understandable to people, and therefore human judgements may be useful for this purpose. In this paper, we demonstrate a scalable crowdsourced system for image clustering, release its code at https://github.com/Toloka/crowdclustering under a permissive license, and also publish demo in an interactive Python notebook. Our experiments on two different image datasets, dresses from Zalando’s FEIDEGGER and shoes from the Toloka Shoes Dataset, confirm that one can yield meaningful clusters with no machine learning purely with crowdsourcing. In addition, these two cases show the usefulness of such an approach for domain-specific clustering process in fashion recommendation systems or e-commerce.	Daniil Fedulov, Daniil Likhobaba, Dmitry Ustalov	Toloka, Belgrade, Serbia
134	Enticing Local Governments to Produce FAIR Freedom of Information Act Dossiers	Government transparency is central in a democratic society, and increasingly governments at all levels are required to publish records and data either proactively, or upon so-called Freedom of Information (FIA) requests. However, public bodies who are required by law to publish many of their documents turn out to have great difficulty to do so. And what they publish often is in a format that still breaches the requirements of the law, stipulating principles comparable to the FAIR data principles. Hence, this demo is addressing a timely problem: the FAIR publication of FIA dossiers, which is obligatory in The Netherlands since May 1st 2022.	Filipp Perasedillo, Jaap Kamps, Maarten Marx, Maik Larooij	University of Amsterdam
135	Automatic Videography Generation from Audio Tracks	This paper describes a prototype of an automatic videography generation system. Given any YouTube video of a song, a set of images are retrieved corresponding to each line of the song which are automatically inserted and aligned into a video track.	Andrew Parker, Debasis Ganguly, Stergious Aji	University of Glasgow
136	Ablesbarkeitsmesser: A System for Assessing the Readability of German Text	While several approaches have been proposed for estimating the readability of English texts, there is much less work for other languages. In this paper, we present an online service, available at https://readability-check.org/ , that provides five well-established statistical methods and two machine learning models for measuring the readability of texts in German. For the machine learning methods, we train two BERT models. To bring all the measures together, we provide an interactive website that allows users to evaluate the readability of German texts at the sentence level. Our research can be useful for anyone who wants to know whether the text content at hand is easy or difficult and therefore can be used in certain situations or rather needs to be adapted and improved. In education, for example, it can help to assess the suitability of a particular teaching material for a particular grade.	Adam Jatowt, Florian Pickelmann, Michael Färber	Karlsruhe Inst Technol KIT, Karlsruhe, Germany; Univ Innsbruck, Innsbruck, Austria
137	FACADE: Fake Articles Classification and Decision Explanation	The daily use of social networks and the resulting dissemination of disinformation over those media have greatly contributed to the rise of the fake news phenomenon as a global problem. Several manual and automatic approaches are currently in place to try to tackle and defuse this issue, which is becoming nearly uncontrollable. In this paper, we propose Facade, a fake news detection system that aims to provide a complete solution for classifying news articles and explain the motivation behind every prediction. The system is designed with a cascading architecture composed of two classification pipelines dealing with either low-level or high-level descriptors, with the overall goal of achieving a consistent confidence score on each outcome. In addition, the system is equipped with an explainable user interface through which fact-checkers and content managers can visualise in detail the features leading to a certain prediction and have the possibility for manual cross-checking.	Erasmo Purificato, Ernesto William De Luca, Marcus Thiel, Saijal Shahania	Otto von Guericke University Magdeburg
138	PsyProf: A Platform for Assisted Screening of Depression in Social Media	Depression is one of the most prevalent mental disorders. For its effective treatment, patients need a quick and accurate diagnosis. Mental health professionals use self-report questionnaires to serve that purpose. These standardized questionnaires consider different depression symptoms in their evaluations. However, mental health stigmas heavily influence patients when filling out a questionnaire. In contrast, many people feel more at ease discussing their mental health issues on social media. This demo paper presents a platform for assisted examination and tracking of symptoms of depression for social media users. In order to bring a broader context, we have complemented our tool with user profiling. We show a platform that helps professionals with data labelling, relying on depression estimators and profiling models.	Anxo Pérez, Javier Parapar, Paloma PiotPerezAbadin, Álvaro Barreiro	Univ A Coruna, Informat Retrieval Lab, CITIC, Campus Elvina S-N, La Coruna 15071, Spain
139	Legal IR and NLP: The History, Challenges, and State-of-the-Art	Artificial Intelligence (AI), Machine Learning (ML), Information Retrieval (IR) and Natural Language Processing (NLP) are transforming the way legal professionals and law firms approach their work. The significant potential for the application of AI to Law, for instance, by creating computational solutions for legal tasks, has intrigued researchers for decades. This appeal has only been amplified with the advent of Deep Learning (DL). It is worth noting that working with legal text is far more challenging as compared to the other subdomains of IR/NLP, mainly due to the typical characteristics of legal text, such as considerably longer documents, complex language and lack of large-scale annotated datasets. In this tutorial, we introduce the audience to these characteristics of legal text, and with it, the challenges associated with processing the legal documents. We touch upon the history of AI and Law research, and how it has evolved over the years from relatively simpler approaches to more complex ones, such as those involving DL. We organize the tutorial as follows. First, we provide a brief introduction to state-of-the-art research in the general domain of IR and NLP. We then discuss in more detail IR/NLP tasks specific to the legal domain. We outline the methodologies (both from an academic and industry perspective), and the available tools and datasets to evaluate the methodologies. This is then followed by a hands-on coding/demo session.	Debasis Ganguly, Jack G. Conrad, Kripabandhu Ghosh, Paheli Bhattacharya, Pawan Goyal, Saptarshi Ghosh, Shounak Paul, Shubham Kumar Nigam	Indian Inst Sci Educ & Res Kolkata, Mohanpur, India; Indian Inst Technol Kanpur, Kanpur, Uttar Pradesh, India; Indian Inst Technol Kharagpur, Kharagpur, W Bengal, India; Thomson Reuters Labs, Minneapolis, MN USA; Univ Glasgow, Glasgow, Lanark, Scotland
140	Uncertainty Quantification for Text Classification	This full-day tutorial introduces modern techniques for practical uncertainty quantification specifically in the context of multi-class and multi-label text classification. First, we explain the usefulness of estimating aleatoric uncertainty and epistemic uncertainty for text classification models. Then, we describe several state-of-the-art approaches to uncertainty quantification and analyze their scalability to big text data: Virtual Ensemble in GBDT, Bayesian Deep Learning (including Deep Ensemble, Monte-Carlo Dropout, Bayes by Backprop, and their generalization Epistemic Neural Networks), Evidential Deep Learning (including Prior Networks and Posterior Networks), as well as Distance Awareness (including Spectral-normalized Neural Gaussian Process and Deep Deterministic Uncertainty). Next, we talk about the latest advances in uncertainty quantification for pre-trained language models (including asking language models to express their uncertainty, interpreting uncertainties of text classifiers built on large-scale language models, uncertainty estimation in text generation, calibration of language models, and calibration for in-context learning). After that, we discuss typical application scenarios of uncertainty quantification in text classification (including in-domain calibration, cross-domain robustness, and novel class detection). Finally, we list popular performance metrics for the evaluation of uncertainty quantification effectiveness in text classification. Practical hands-on examples/exercises are provided to the attendees for them to experiment with different uncertainty quantification methods on a few real-world text classification datasets such as CLINC150.	Bilyana TanevaPopova, Dell Zhang, Masoud Makrehchi, Murat Sensoy	Amazon Alexa AI, London, England; Kings Coll London, London, England; Thomson Reuters Labs, London, England; Thomson Reuters Labs, Toronto, ON, Canada; Thomson Reuters Labs, Zug, Switzerland
141	Geographic Information Extraction from Texts (GeoExT)	A large volume of unstructured texts, containing valuable geographic information, is available online. This information - provided implicitly or explicitly - is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction.	Bernd Resch, Jens Kersten, Xuke Hu, Yingjie Hu	German Aerosp Ctr DLR, Inst Data Sci, Jena, Germany; Salzburg Univ, Dept Geoinformat, Salzburg, Austria; Univ Buffalo, Dept Geog, Buffalo, NY USA
142	Building Safe and Reliable AI Systems for Safety Critical Tasks with Vision-Language Processing	Although AI systems have been applied in various fields and achieved impressive performance, their safety and reliability are still a big concern. This is especially important for safety-critical tasks. One shared characteristic of these critical tasks is their risk sensitivity, where small mistakes can cause big consequences and even endanger life. There are several factors that could be guidelines for the successful deployment of AI systems in sensitive tasks: (i) failure detection and out-of-distribution (OOD) detection; (ii) overfitting identification; (iii) uncertainty quantification for predictions; (iv) robustness to data perturbations. These factors are also challenges of current AI systems, which are major blocks for building safe and reliable AI. Specifically, the current AI algorithms are unable to identify common causes for failure detection. Furthermore, additional techniques are required to quantify the quality of predictions. All these contribute to inaccurate uncertainty quantification, which lowers trust in predictions. Hence obtaining accurate model uncertainty quantification and its further improvement are challenging. To address these issues, many techniques have been proposed, such as regularization methods and learning strategies. As vision and language are the most typical data type and have many open source benchmark datasets, this thesis will focus on vision-language data processing for tasks like classification, image captioning, and vision question answering. In this thesis, we aim to build a safeguard by further developing current techniques to ensure the accurate model uncertainty for safety-critical tasks.	Shuang Ao	Open Univ, Walton Hall, Milton Keynes MK7 6AA, Bucks, England
143	Identifying and Representing Knowledge Delta in Scientific Literature	The process of continuously keeping up to date with the state-of-the-art on a specific research topic is a challenging task for researchers not least due to the rapid increase of published research. In this research proposal, we define the term Knowledge Delta (KD) between scientific articles which refers to the differences between pairs of research articles that are similar in some aspects. We propose a three-phase research methodology to identify and represent the KD between articles. We intend to explore the effect of applying different text representations on extracted facts from scientific articles on the downstream task of KD identification.	Alaa ElEbshihy	Res Studio Austria, Vienna, Austria
144	Disinformation Detection: Knowledge Infusion with Transfer Learning and Visualizations	The automatic detection of disinformation has gained an increased focus by the research community during the last years. The spread of false information can be an issue for political processes, opinion mining and journalism in general. In this dissertation, I propose a novel approach to gain new insights on the automatic detection of disinformation in textual content. Additionally, I will combine multiple research domains, such as fake news, hate speech, propaganda, and extremism. For this purpose, I will create two novel and annotated datasets in German - a large multi-label dataset for disinformation detection in news articles and a second dataset for hate speech detection in social media posts, which both can be used for training the models in the listed domains via transfer learning. With the usage of transfer learning, an extensive data analysis and classification of the presented domains will be conducted. The classification models will be enhanced during and after training using a knowledge graph, containing additional information (i.e. named entities, relationships, topics), to find explicit insights about the common traits or lines of disinformative arguments in an article. Lastly, methods of explainable artificial intelligence will be combined with visualization techniques to understand the models predictions and present the results in a user-friendly and interactive way.	Mina Schütz	Darmstadt University of Applied Sciences
145	A Comprehensive Overview of Consumer Conflicts on Social Media	The use of social media platforms is increasingly prevalent in society, providing brands with a multitude of opportunities to interact with consumers. However, literature has shown this increased usage has negative impacts for users who have experienced depression, anxiety, and stress and brands who see increasing volumes of hate within their communities such as bullying, conflicts, complaints, and harmful content. Existing research focuses on extreme forms of conflict, largely ignoring the lesser forms which still pose a significant threat to consumer and brand welfare. This research aims to capture the full spectrum of online conflict, providing a comprehensive overview of the problem from an interdisciplinary marketing and computer science perspective. I propose a further investigation into online hate, utilising big data analysis to establish an understanding of triggers, consequences and brand responses to online hate. Initially, I will conduct a systematic literature review exploring the definitions and methodology used within the hate research domain. Secondly, I will conduct an investigation into state-of-the-art models and classification systems, producing an analysis on the prevalence of hate and its various forms on social media. Finally, I plan to establish the features of social media data which constitute triggers for online conflicts. Then, through a combination of user studies, sentiment analysis, and emotion detection I will examine the consequences of these conflicts. This project represents a unique opportunity to combine cutting edge marketing theories with big data analysis, this collaborative approach will offer a considerable contribution to academic literature.	Oliver Warke	Univ Glasgow, Sch Comp Sci, Glasgow, Scotland
146	iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge	Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Patients have to manage alternated periods in hospital with care at home, experiencing a constant uncertainty regarding the timing of the disease acute phases and facing a considerable psychological and economic burden that also involves their caregivers. Clinicians, on the other hand, need tools able to support them in all the phases of the patient treatment, suggest personalized therapeutic decisions, indicate urgently needed interventions.	Adriano Chiò, Arianna Dagliati, Barbara Di Camillo, Eleonora Tavazzi, Helena Aidos, Jose Manuel García Dominguez, Mamede de Carvalho, Nicola Ferro, Paola Cavalla, Piero Fariselli, Roberto Bergamaschi, Sara C. Madeira	Citta Salute & Sci, Turin, Italy; Gregorio Maranon Hosp Madrid, Madrid, Spain; IRCCS Fdn C Mondino Pavia, Pavia, Italy; Univ Lisbon, Lisbon, Portugal; Univ Padua, Padua, Italy; Univ Pavia, Pavia, Italy; Univ Turin, Turin, Italy
147	LongEval: Longitudinal Evaluation of Model Performance at CLEF 2023	In this paper, we describe the plans for the first LongEval CLEF 2023 shared task dedicated to evaluating the temporal persistence of Information Retrieval (IR) systems and Text Classifiers. The task is motivated by recent research showing that the performance of these models drops as the test data becomes more distant, with respect to time, from the training data. LongEval differs from traditional shared IR and classification tasks by giving special consideration to evaluating models aiming to mitigate performance drop over time. We envisage that this task will draw attention from the IR community and NLP researchers to the problem of temporal persistence of models, what enables or prevents it, potential solutions and their limitations.	Alaa ElEbshihy, Arkaitz Zubiaga, Christophe Servan, Daniel Loureiro, Elena Kochkina, Florina Piroi, Gabriela González Sáez, Harish Tayyar Madabushi, Hsubhas Borkakoty, Iman Munire Bilal, José CamachoCollados, Lorraine Goeuriot, Luis Espinosa Anke, Maria Liakata, Martin Popel, Petra Galuscáková, Philippe Mulhem, Rabab Alkhalifa, Romain Deveaud	Cardiff University; Charles University; Queen Mary University of London; Qwant; Research Studios Austria, Data Science Studio; Univ. Grenoble Alpes, CNRS, Grenoble INP, Institute of Engineering Univ. Grenoble Alpes., LIG; University of Bath; University of Warwick
148	Science for Fun: The CLEF 2023 JOKER Track on Automatic Wordplay Analysis	Understanding and translating humorous wordplay often requires recognition of implicit cultural references, knowledge of word formation processes, and discernment of double meanings - issues which pose challenges for humans and computers alike. This paper introduces the CLEF 2023 JOKER track, which takes an interdisciplinary approach to the creation of reusable test collections, evaluation metrics, and methods for the automatic processing of wordplay. We describe the track's interconnected shared tasks for the detection, location, interpretation, and translation of puns. We also describe associated data sets and evaluation methodologies, and invite contributions making further use of our data.	Adam Jatowt, AnneGwenn Bosser, Grigori Sidorov, Liana Ermakova, Tristan Miller, Victor Manuel PalmaPreciado	Austrian Res Inst Artificial Intelligence OFAI, Vienna, Austria; Ecole Natl Ingenieurs Brest, Lab STICC CNRS UMR 6285, Plouzane, France; Inst Politecn Nacl IPN, Ctr Invest Computac CIC, Mexico City, Mexico; Univ Bretagne Occidentale, HCTI, Brest, France; Univ Innsbruck, Innsbruck, Austria
149	LifeCLEF 2023 Teaser: Species Identification and Prediction Challenges	Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants, animals and fungi is hindering the aggregation of new data and knowledge. Identifying and naming living organisms is almost impossible for the general public and is often difficult, even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2023 edition proposes five data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: very large-scale plant identification from images, (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: remote sensing based prediction of species, (iv) SnakeCLEF: snake recognition in medically important scenarios, and (v) FungiCLEF: fungi recognition beyond 0-1 cost.	Alexis Joly, Andrew Durso, Benjamin Kellenberger, Christophe Botella, Diego Marcos, Elijah Cole, Henning Müller, Hervé Glotin, Hervé Goëau, Holger Klinck, Ivan Eggel, Lukás Picek, Marek Hrúz, Maximilien Servajean, Milan Sulc, Pierre Bonnet, Robert Planqué, Sara Si Moussi, Stefan Kahl, Titouan Lorieul, Tom Denton, WillemPier Vellinga	Aix Marseille Univ, Univ Toulon, CNRS, LIS,DYNI Team, Marseille, France; CIRAD, UMR AMAP, Montpellier, France; Caltech, Dept Comp & Math Sci, Pasadena, CA USA; Cornell Univ, Cornell Lab Ornithol, KLYCCB, Ithaca, NY USA; Florida Gulf Coast Univ, Dept Biol Sci, Ft Myers, FL USA; Google LLC, San Francisco, CA USA; HES SO, Sierre, Switzerland; Rossum Ai, Prague, Czech Republic; Univ Hawaii Hilo, Listening Observ Hawaiian Ecosyst, Hilo, HI USA; Univ Montpellier, CNRS, Inria, LIRMM, Montpellier, France; Univ Montpellier, LIRMM, AMI, Univ Paul Valery Montpellier,CNRS, Montpellier, France; Univ West Bohemia, Dept Cybernet, FAV, Plzen, Czech Republic; Xeno Canto Fdn, The Hague, Netherlands
150	eRisk 2023: Depression, Pathological Gambling, and Eating Disorder Challenges	In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. Since then, thanks to the participants' work, we have developed detection models and datasets for depression, anorexia, pathological gambling and self-harm. In 2023, it will be the seventh edition of the lab, where we will present a new type of task on sentence ranking for depression symptoms. This paper outlines the work that we have done to date, discusses key lessons learned in previous editions, and presents our plans for eRisk 2023.	David E. Losada, Fabio Crestani, Javier Parapar, Patricia MartínRodilla	Univ A Coruna, Informat Retrieval Lab, Ctr Invest Tecnol Informac & Comunicac CITIC, La Coruna, Spain; Univ Santiago de Compostela, Ctr Singular Invest Tecnol Intelixentes CiTIUS, Santiago, Spain; Univ Svizzera Italiana USI, Fac Informat, Lugano, Switzerland
151	Overview of EXIST 2023: sEXism Identification in Social NeTworks	The paper describes the lab on Sexism identification in social networks (EXIST 2023) that will be hosted as a lab at the CLEF 2023 conference. The lab consists of three tasks, two of which are continuation of EXIST 2022 (sexism detection and sexism categorization) and a third and novel one on source intention identification. For this edition new test and training data will be provided and some novelties are introduced in order to tackle two central problems of Natural Language Processing (NLP): bias and fairness. Firstly, the sampling and data gathering process will take into account different sources of bias in data: seed, temporal and user bias. During the annotation process we will also consider some sources of "label bias" that come from the social and demographic characteristics of the annotators. Secondly, we will adopt the "learning with disagreements" paradigm by providing datasets containing also pre-aggregated annotations, so that systems can make use of this information to learn from different perspectives. The general goal of the EXIST shared tasks is to advance the state of the art in online sexism detection and categorization, as well as investigating to what extent bias can be characterized in data and whether systems may take fairness decisions when learning from multiple annotations.	Damiano Spina, Enrique Amigó, Jorge CarrillodeAlbornoz, Julio Gonzalo, Laura Plaza, Paolo Rosso, Roser Morante	RMIT University; Universidad Nacional de Educación a Distancia (UNED); Universidad Politécnica de Valencia (UPV)
152	Extractive Summarization of Financial Earnings Call Transcripts - Or: When GREP Beat BERT	To date, automatic summarization methods have been mostly developed for (and applied to) general news articles, whereas other document types have been neglected. In this paper, we introduce the task of summarizing financial earnings call transcripts, and we present a method for summarizing this text type essential for the financial industry. Earnings calls are briefing events common for public companies in many countries, typically in the form of conference calls held between company executives and analysts that consist of a spoken monologue part followed by moderated questions and answers. We show that traditional methods work less well in this domain, we present a method suitable for summarizing earnings calls. Our large-scale evaluation on a new human-annotated corpus of summary-worthy sentences shows that this method outperforms a set of strong baselines, including a new one that we propose specifically for earnings calls. To the best of our knowledge, this is the first application of summarization to financial earnings calls transcripts, a primary source of information for financial professionals.	George Gkotsis, Jochen L. Leidner, Timothy Nugent	Coburg Univ Appl Sci, Friedrich Streib Str 2, D-96450 Coburg, Germany; GSR Markets, London, England; Kailua Labs, Patras, Greece
153	Multivariate Powered Dirichlet-Hawkes Process	The publication time of a document carries a relevant information about its semantic content. The Dirichlet-Hawkes process has been proposed to jointly model textual information and publication dynamics. This approach has been used with success in several recent works, and extended to tackle specific challenging problems –typically for short texts or entangled publication dynamics. However, the prior in its current form does not allow for complex publication dynamics. In particular, inferred topics are independent from each other –a publication about finance is assumed to have no influence on publications about politics, for instance. In this work, we develop the Multivariate Powered Dirichlet-Hawkes Process (MPDHP), that alleviates this assumption. Publications about various topics can now influence each other. We detail and overcome the technical challenges that arise from considering interacting topics. We conduct a systematic evaluation of MPDHP on a range of synthetic datasets to define its application domain and limitations. Finally, we develop a use case of the MPDHP on Reddit data. At the end of this article, the interested reader will know how and when to use MPDHP, and when not to.	Gaël PouxMédard, Julien Velcin, Sabine Loudcher	Université de Lyon, Lyon 2, ERIC UR 3083
154	DocILE 2023 Teaser: Document Information Localization and Extraction	The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the reproducibility and cross-evaluation of published methods. The DocILE 2023 competition, hosted as a lab at the CLEF 2023 conference and as an ICDAR 2023 competition, will run the first major benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents. With thousands of annotated real documents from open sources, a hundred thousand of generated synthetic documents, and nearly a million unlabeled documents, the DocILE lab comes with the largest publicly available dataset for KILE and LIR. We are looking forward to contributions from the Computer Vision, Natural Language Processing, Information Retrieval, and other communities. The data, baselines, code and up-to-date information about the lab and competition are available at https://docile.rossum.ai/.	Ahmed Hamdi, Matyás Skalický, Milan Sulc, Stepán Simsa, Yash Patel	Czech Tech Univ, Visual Recognit Grp, Prague, Czech Republic; Rossum Ai, Prague, Czech Republic; Univ La Rochelle, La Rochelle, France
155	Knowing What and How: A Multi-modal Aspect-Based Framework for Complaint Detection	With technological advancements, the proliferation of e-commerce websites and social media platforms has created an avenue for customers to provide feedback to enterprises based on their overall experience. Customer feedback serves as an independent validation tool that could boost consumer trust in the brand. Whether it is a recommendation or review of a product, it provides insight allowing businesses to understand what they are doing right or wrong. By automatically analyzing customer complaints at the aspect-level enterprises can connect to their customers by customizing products and services according to their needs quickly and deftly. In this paper, we introduce the task of Aspect-Based Complaint Detection (ABCD). ABCD identifies the aspects in the given review about a product and also finds if the aspect mentioned in the review signifies a complaint or non-complaint. Specifically, a task solver must detect duplets (What, How) from the inputs that show WHAT the targeted features are and HOW they are complaints. To address this challenge, we propose a deep-learning-based multi-modal framework, where the first stage predicts what the targeted aspects are, and the second stage categorizes whether the targeted aspect is associated with a complaint or not. We annotate the aspect categories and associated complaint/non-complaint labels in the recently released multi-modal complaint dataset (CESAMARD), which spans five domains (books, electronics, edibles, fashion, and miscellaneous). Based on extensive evaluation our methodology established a benchmark performance in this novel aspect-based complaint detection task and also surpasses a few strong baselines developed from state-of-the-art related methods (Resources available at: https://github.com/appy1608/ECIR2023_Complaint-Detection ).	Apoorva Singh, Shubham Sharma, Sriparna Saha, Vivek Kumar Gangwar	Indian Institute of Technology Patna; Panjab University
156	What Is Your Cause for Concern? Towards Interpretable Complaint Cause Analysis	The abundance of information available on social media and the regularity with which complaints are posted online emphasizes the need for automated complaint analysis tools. Prior study has focused chiefly on complaint identification and complaint severity prediction: the former attempts to classify a piece of content as either complaint or non-complaint. The latter seeks to group complaints into various severity classes depending on the threat level that the complainant is prepared to accept. The complainant’s goal could be to express disapproval, seek compensation, or both. As a result, the complaint detection model should be interpretable or explainable. Recognizing the cause of a complaint in the text is a crucial yet untapped area of natural language processing research. We propose an interpretable complaint cause analysis model that is grounded on a dyadic attention mechanism. The model jointly learns complaint classification, emotion recognition, and polarity classification as the first sub-problem. Subsequently, the complaint cause extraction and the associated severity level prediction as the second sub-problem. We add the causal span annotation for the existing complaint classes in a publicly available complaint dataset to accomplish this. The results indicate that existing computational tools can be repurposed to tackle highly relevant novel tasks, thereby finding new research opportunities (Resources available at: https://bit.ly/Complaintcauseanalysis ).	Apoorva Singh, Prince Jha, Rohan Bhatia, Sriparna Saha	Indian Inst Technol Patna, Bihta, India
157	Towards Effective Paraphrasing for Information Disguise	Information Disguise ( ID ), a part of computational ethics in Natural Language Processing ( NLP ), is concerned with best practices of textual paraphrasing to prevent the non-consensual use of authors’ posts on the Internet. Research on ID becomes important when authors’ written online communication pertains to sensitive domains, e.g., mental health. Over time, researchers have utilized AI-based automated word spinners (e.g., SpinRewriter, WordAI) for paraphrasing content. However, these tools fail to satisfy the purpose of ID as their paraphrased content still leads to the source when queried on search engines. There is limited prior work on judging the effectiveness of paraphrasing methods for ID on search engines or their proxies, neural retriever ( NeurIR ) models. We propose a framework where, for a given sentence from an author’s post, we perform iterative perturbation on the sentence in the direction of paraphrasing with an attempt to confuse the search mechanism of a NeurIR system when the sentence is queried on it. Our experiments involve the subreddit “r/AmItheAsshole” as the source of public content and Dense Passage Retriever as a NeurIR system-based proxy for search engines. Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search. Our multi-phrase substitution scheme succeeds in disguising sentences 82% of the time and hence takes an essential step towards enabling researchers to disguise sensitive content effectively before making it public. We also release the code of our approach. ( https://github.com/idecir/idecir-Towards-Effective-Paraphrasing-for-Information-Disguise )	Anmol Agarwal, Joseph Reagle, Manas Gaur, Ponnurangam Kumaraguru, Shrey Gupta, Vamshi Bonagiri	Int Inst Informat Technol, Hyderabad, India; Northeastern Univ, Boston, MA USA; Univ Maryland, Baltimore, MD USA
158	Generating Topic Pages for Scientific Concepts Using Scientific Publications	In this paper, we describe Topic Pages, an inventory of scientific concepts and information around them extracted from a large collection of scientific books and journals. The main aim of Topic Pages is to provide all the necessary information to the readers to understand scientific concepts they come across while reading scholarly content in any scientific domain. Topic Pages are a collection of automatically generated information pages using NLP and ML, each corresponding to a scientific concept. Each page contains three pieces of information: a definition, related concepts, and the most relevant snippets, all extracted from scientific peer-reviewed publications. In this paper, we discuss the details of different components to extract each of these elements. The collection of pages in production contains over 360, 000 Topic Pages across 20 different scientific domains with an average of 23 million unique visits per month, constituting it a popular source for scientific information.	George Tsatsaronis, Hosein Azarbonyad, Zubair Afzal	Elsevier
159	Topic Refinement in Multi-level Hate Speech Detection	Hate speech detection is quite a hot topic in NLP and various annotated datasets have been proposed, most of them using binary generic (hateful vs. non-hateful) or finer-grained specific (sexism/racism/etc.) annotations, to account for particular manifestations of hate. We explore in this paper how to transfer knowledge across both different manifestations, and different granularity or levels of hate speech annotations from existing datasets, relying for the first time on a multilevel learning approach which we can use to refine generically labelled instances with specific hate speech labels. We experiment with an easily extensible Text-to-Text approach, based on the T5 architecture, as well as a combination of transfer and multitask learning. Our results are encouraging and constitute a first step towards automatic annotation of hate speech datasets, for which only some or no fine-grained annotations are available.	Farah Benamara, Patricia Chiril, Tom Bourgeade, Véronique Moriceau	Univ Chicago, Chicago, IL USA; Univ Toulouse, CNRS, IRIT, UT3, Toulouse, France
160	Adversarial Adaptation for French Named Entity Recognition	Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of large, labeled datasets. We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora to improve feature extraction and enable better generalization. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains to introduce more variations during training and reduce overfitting. Experimental results on three labeled datasets show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of Transformer models, source datasets, and target corpora. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.	Aaryan Gupta, Arjun Choudhry, Dinesh Kumar Vishwakarma, Inder Khatri, MarieJean Meurs, Maxime Nicol, Pankaj Gupta	Delhi Technol Univ, Biometr Res Lab, New Delhi, India; Univ Quebec Montreal, IKB Lab, Montreal, PQ, Canada
161	Justifying Multi-label Text Classifications for Healthcare Applications	The healthcare domain is a very active area of research for Natural Language Processing (NLP). The classification of medical records according to codes from the International Classification of Diseases (ICD) is an essential task in healthcare. As a very sensitive application, the automatic classification of personal medical records cannot be immediately trusted without human approval. As such, it is desirable for classification models to provide reasons for each decision, such that the medical coder can validate model predictions without reading the entire document. AttentionXML is a multi-label classification model that has shown high applicability for this task and can provide attention distributions for each predicted label. In practice, we have found that these distributions do not always provide relevant spans of text. We propose a simple yet effective modification to AttentionXML for finding spans of text that can better aid the medical coders: splitting the BiLSTM of AttentionXML into a forward and a backward LSTM, creating two attention distributions that find the leftmost and rightmost limits of the text spans. We also propose a novel metric for the usefulness of our model’s suggestions by computing the drop in confidence from masking out the selected text spans. We show that our model has a similar classification performance to AttentionXML while surpassing it in obtaining relevant text spans.	Afonso Mendes, Gonçalo M. Correia, João Figueira, Michalina Strzyz	Priberam Labs, Lisbon, Portugal
162	Towards Quantifying the Privacy of Redacted Text	In this paper we propose use of a k-anonymity-like approach for evaluating the privacy of redacted text. Given a piece of redacted text we use a state of the art transformer-based deep learning network to reconstruct the original text. This generates multiple full texts that are consistent with the redacted text, i.e. which are grammatical, have the same non-redacted words etc., and represents each of these using an embedding vector that captures sentence similarity. In this way we can estimate the number, diversity and quality of full text consistent with the redacted text and so evaluate privacy.	Douglas J. Leith, Vaibhav Gusain	Trinity Coll Dublin, Dublin, Ireland
163	Detecting Stance of Authorities Towards Rumors in Arabic Tweets: A Preliminary Study	A myriad of studies addressed the problem of rumor verification in Twitter by either utilizing evidence from the propagation networks or external evidence from the Web. However, none of these studies exploited evidence from trusted authorities. In this paper, we define the task of detecting the stance of authorities towards rumors in tweets, i.e., whether a tweet from an authority agrees, disagrees, or is unrelated to the rumor. We believe the task is useful to augment the sources of evidence utilized by existing rumor verification systems. We construct and release the first Authority STance towards Rumors (AuSTR) dataset, where evidence is retrieved from authority timelines in Arabic Twitter. Due to the relatively limited size of our dataset, we study the usefulness of existing datasets for stance detection in our task. We show that existing datasets are somewhat useful for the task; however, they are clearly insufficient, which motivates the need to augment them with annotated data constituting stance of authorities from Twitter.	Fatima Haouari, Tamer Elsayed	Qatar University
164	Dirichlet-Survival Process: Scalable Inference of Topic-Dependent Diffusion Networks	Information spread on networks can be efficiently modeled by considering three features: documents’ content, time of publication relative to other publications, and position of the spreader in the network. Most previous works model up to two of those jointly, or rely on heavily parametric approaches. Building on recent Dirichlet-Point processes literature, we introduce the Houston (Hidden Online User-Topic Network) model, that jointly considers all those features in a non-parametric unsupervised framework. It infers dynamic topic-dependent underlying diffusion networks in a continuous-time setting along with said topics. It is unsupervised; it considers an unlabeled stream of triplets shaped as (time of publication, information’s content, spreading entity) as input data. Online inference is conducted using a sequential Monte-Carlo algorithm that scales linearly with the size of the dataset. Our approach yields consequent improvements over existing baselines on both cluster recovery and subnetworks inference tasks.	Gaël PouxMédard, Julien Velcin, Sabine Loudcher	Univ Lyon, ERIC UR 3083, Lyon 2, 5 Ave Pierre Mendes France, F-69676 Bron, France
165	Consumer Health Question Answering Using Off-the-Shelf Components	In this paper, we address the task of open-domain health question answering (QA). The quality of existing QA systems heavily depends on the annotated data that is often difficult to obtain, especially in the medical domain. To tackle this issue, we opt for PubMed and Wikipedia as trustworthy document collections to retrieve evidence. The questions and retrieved passages are passed to off-the-shelf question answering models, whose predictions are then aggregated into a final score. Thus, our proposed approach is highly data-efficient. Evaluation on 113 health-related yes/no question and answer pairs demonstrates good performance achieving AUC of 0.82.	Alexander Bondarenko, Alexander Pugachev, Ekaterina Artemova, Pavel Braslavski	Friedrich-Schiller-Universität Jena; HSE University; LMU Munich
166	MOO-CMDS+NER: Named Entity Recognition-Based Extractive Comment-Oriented Multi-document Summarization	In this work, we propose an unsupervised extractive summarization framework for generating good quality summaries which are supplemented by the comments posted by the end-users. Using the evolutionary multi-objective optimization concept, different objective functions for assessing the quality of a summary, like diversity and the relevance of sentences in relation to comments, are optimized simultaneously. In the literature, named entity recognition (NER) has been shown to be useful in the summarization process. The current work is the first of its kind where we have introduced a new objective function that utilizes the concept of NER in news documents and user comments to score the news sentences. To test how well the new objective function works, different combinations of the NER-based objective function with already existing objective functions were tested on the English and French datasets using ROUGE 1, 2, and SU4 F1-scores. We have also investigated the abstractive and compressive summarization approaches for our comparative analysis. The code of the proposed work is available at the github repository https://github.com/vishalsinghroha/Unsupervised-Comment-based-Multi-document-Extractive-Summarization .	José G. Moreno, Naveen Saini, Sriparna Saha, Vishal Singh Roha	Indian Inst Informat Technol, Lucknow, Uttar Pradesh, India; Indian Inst Technol Patna, Patna, Bihar, India; Univ La Rochelle, L3i, F-17000 La Rochelle, France
167	Evaluating Humorous Response Generation to Playful Shopping Requests	AI assistants are gradually becoming embedded in our lives, utilized for everyday tasks like shopping or music. In addition to the everyday utilization of AI assistants, many users engage them with playful shopping requests, gauging their ability to understand - or simply seeking amusement. However, these requests are often not being responded to in the same playful manner, causing dissatisfaction and even trust issues. In this work, we focus on equipping AI assistants with the ability to respond in a playful manner to irrational shopping requests. We first evaluate several neural generation models, which lead to unsuitable results - showing that this task is non-trivial. We devise a simple, yet effective, solution, that utilizes a knowledge graph to generate template-based responses grounded with commonsense. While the commonsense-aware solution is slightly less diverse than the generative models, it provides better responses to playful requests. This emphasizes the gap in commonsense exhibited by neural language models.	Alex Libov, Chen Shani, Natalie Shapira, Oren Kalinsky, Sofia Tolmach	Amazon Sci, Tel Aviv, Israel; Bar Ilan Univ, Ramat Gan, Israel; Hebrew Univ Jerusalem, Jerusalem, Israel
168	Joint Span Segmentation and Rhetorical Role Labeling with Data Augmentation for Legal Documents	Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.	Matthias Grabmair, Philipp Bock, T. Y. S. S. Santosh	Technical University of Munich
169	Capturing Cross-Platform Interaction for Identifying Coordinated Accounts of Misinformation Campaigns	Disinformation campaigns on social media, involving coordinated activities from malicious accounts towards manipulating public opinion, have become increasingly prevalent. There has been growing evidence of social media abuse towards influencing politics and social issues in other countries, raising numerous concerns. The identification and prevention of coordinated campaigns has become critical to tackling disinformation at its source. Existing approaches to detect malicious campaigns make strict assumptions about coordinated behaviours, such as malicious accounts perform synchronized actions or share features assumed to be indicative of coordination. Others require part of the malicious accounts in the campaign to be revealed in order to detect the rest. Such assumptions significantly limit the effectiveness of existing approaches. In contrast, we propose AMDN (Attentive Mixture Density Network) to automatically uncover coordinated group behaviours from account activities and interactions between accounts, based on temporal point processes. Furthermore, we leverage the learned model to understand and explain the behaviours of coordinated accounts in disinformation campaigns. We find that the average influence between coordinated accounts is the highest, whereas these accounts are not much influenced by regular accounts. We evaluate the effectiveness of the proposed method on Twitter data related to Russian interference in US Elections. Additionally, we identify disinformation campaigns in COVID-19 data collected from Twitter, and provide the first evidence and analysis of existence of coordinated disinformation campaigns in the ongoing pandemic.	Karishma Sharma, Yan Liu, Yizhou Zhang	Amazon, Sunnyvale, CA 94089 USA; Univ Southern Calif, Los Angeles, CA 90007 USA