RECSYS2025
November 8, 2025 · View on GitHub
会议论文列表
本会议共有 226 篇论文
| 序号 | 标题 | 链接 | 推荐理由 | 推荐度 | 摘要 | 作者 | 组织 |
|---|---|---|---|---|---|---|---|
| 1 | Scaling Retrieval for Web-Scale Recommenders: Lessons from Inverted Indexes to Embedding Search | 0 | Web-scale search and recommendation systems depend on efficient retrieval to manage massive datasets and user traffic. This paper chronicles our evolutionary path in building the retrieval layer at LinkedIn, progressing from a CPU-based inverted index system to a GPU-accelerated embedding-based retrieval system. Initially anchored by traditional term-based retrieval, we enhanced relevance and productivity through learning-to-retrieve approaches by generating mappings among inferred attributes. As these early efforts encountered limitations in inferring and matching attributes at scale, we transitioned to embedding-based retrieval for greater flexibility and performance, but found that existing infrastructure couldn’t support large-scale production needs. This led us to develop a GPU-based retrieval system designed for high performance, flexible modeling, and multi-objective business optimization. We present the infrastructure innovations, optimizations, and key lessons learned throughout this transition, offering practical insights for building scalable, flexible retrieval systems. | Caleb Johnson, Jianqiang Shen, Liangjie Hong, Luke Simon, Qianqi Shen, Shaobo Zhang, Wenjing Zhang, Yuchin Juan | |||
| 2 | Contrastive Conditional Embeddings for Item-based Recommendation at E-commerce Scale | 0 | Item-based recommendation is crucial in e-commerce for helping users navigate the myriad of options available to them. While embedding-based methods are standard, learning high-quality item representations from sparse co-occurrence data is challenging. Deployment at scale is even harder, with a lack of well-documented real-world successes. The two main obstacles are the model size, which scales linearly with the number of items, and the co-occurrence-based training data, which is massive and sparse leading to significant memory, storage, and compute demands. In this work, we propose a conditional factor model combining item co-occurrences and textual information to generate effective embeddings through a contrastive loss with mixed negative sampling for e-commerce recommendations. Our production model exceeds 10 billion parameters–half trainable daily on over 2 billion item-item co-occurrence pairs. We detail key implementation choices that allowed us to overcome the above challenges and successfully deploy the model on Rakuten Group, Inc’s large-scale e-commerce platform in Japan. A/B tests show strong impact, with purchase rate gains of +16.38% and +4.01% across two major recommendation widgets. | Aghiles Salah, Akira Fukumoto, Alexandru Tatar, Lee Xiong, Sarthak Shrivastava, Vincent Michel, Yannick Schwartz | |||
| 3 | Beyond Immediate Click: Engagement-Aware and MoE-Enhanced Transformers for Sequential Movie Recommendation | 0 | Modern video streaming services heavily rely on recommender systems. Although there are many methods for content personalization and recommendation, sequential recommendation models stand out due to their ability to summarize user behavior over time. We propose a novel sequential recommendation framework to address the following key issues: suboptimal negative sampling strategies, fixed user-history context lengths, and single-task optimization objectives, insufficient engagement-aware learning, and short-sighted prediction horizons, ultimately improving both immediate and multi-step next-title prediction for video streaming services. In this work, we propose a novel approach to capture patterns of interaction at different time scales. We also align long-term user happiness with instantaneous intent signals using multi-task learning with engagement-aware personalized loss. Finally, we extend traditional next-item prediction into a next-K forecasting task using a training strategy with soft positive label. Extensive experiments on large-scale streaming data validate the effectiveness of our approach. Our best model outperforms the baseline in NDCG@1 by up to 3.52% under realistic ranking scenarios showing the effectiveness of our engagement-aware and MoE-enhanced designs. Results also show that soft-label Multi-K training is a practical and scalable extension, and that a balanced personalized negative sampling strategy generalizes well. Our framework outperforms baselines across all ranking metrics, providing a robust solution for production-scale streaming recommendations. | Caren Chen, Haiyang Zhang, Haotian Jiang, Sibendu Paul | |||
| 4 | GenSAR: Unifying Balanced Search and Recommendation with Generative Retrieval | 0 | Many commercial platforms provide both search and recommendation (S&R) services to meet different user needs. This creates an opportunity for joint modeling of S&R. Although many joint S&R studies have demonstrated the advantages of integrating S&R, they have also identified a trade-off between the two tasks. That is, when recommendation performance improves, search performance may decline, or vice versa. This trade-off stems from the different information requirements: search prioritizes the semantic relevance between the queries and the items, while recommendation heavily relies on the collaborative relationship between users and items. To balance semantic and collaborative information and mitigate this trade-off, two main challenges arise: (1) How to incorporate both semantic and collaborative information in item representations. (2) How to train the model to understand the different information requirements of S&R. The recent rise of generative retrieval based on Large Language Models (LLMs) for S&R offers a potential solution. Generative retrieval represents each item as an identifier, allowing us to assign multiple identifiers to each item to capture both semantic and collaborative information. Additionally, generative retrieval formulates both S&R as sequence-to-sequence tasks, enabling us to unify different tasks through varied prompts, thereby helping the model better understand the requirements of each task. Based on this, we propose GenSAR, a method that unifies balanced S&R through generative retrieval. We design joint S&R identifiers and training tasks to address the above challenges, mitigate the trade-off between S&R, and further improve both tasks. Experimental results on a public dataset and a commercial dataset validate the effectiveness of GenSAR. | Enyun Yu, Jun Xu, Kai Zheng, Teng Shi, Xiao Zhang, Xiaoxue Zang, Yang Song | |||
| 5 | Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search | 0 | Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps. | Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Matteo Attimonelli, Tommaso Di Noia | |||
| 6 | The Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems | 0 | Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE. | Daniel Bohunek, Martin Spisák, Pavel Kordík, Petr Kasalický, Rodrigo Alves, Vojtech Vancura | |||
| 7 | Determinants of Users' Chance-Seeking Behavior in Search-Based Recommendation | 0 | Serendipity has emerged as a promising strategy to counter overspecialization in retrieval and recommendation systems. While prior studies focus on algorithmic approaches, few have examined users’ desire for chance. This study investigates psychological determinants of chance seeking through two experiments. Experiment 1 found that greater goal specificity suppresses chance seeking. Experiment 2 showed that extraversion, diversive curiosity, enjoyment of ambiguity, and maximization enhance chance seeking, whereas neuroticism and specific curiosity reduce it. These findings suggest that users actively regulate the degree of chance in response to their goal and individual characteristics. The results indicate the importance of considering users’ chance seeking when designing serendipitous recommendation systems. | Eiji Mitsuda, Kazuhisa Miwa, Koji Sato, Ryosuke Nakanishi, Tadashi Odashima, Yuichiro Sumi, Yuki Ninomiya, Yutaro Sone | |||
| 8 | Improve the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback | 0 | Learning user interests is a crucial aspect of personalized recommendation, as it can create a more personal experience for users to drive their deep-engagement, satisfaction, and loyalty. In this work, we focus on improving users’ interest relevance experience, making users truly feel "this app knows me!" and thus leading to long-term user retention. However, accurately capturing users’ interest remains a significant challenge. Traditional approaches using users’ historical engagements with interest clusters lack sensitivity and accuracy; because such heuristic rules on predefined clusters can easily fall into the ranking feedback loop and thus poorly align with users’ true interest preferences. In this paper, we built an User True Interest Survey (UTIS) model to directly train on user survey data and predict a user’s interest affinity on any given piece of content. The UTIS model is added to the main ranking system to reduce feedback bias and leads to better relevance towards users’ core interests. The UTIS model demonstrates high offline accuracy and high generalization capability in online experiments. On a commercial videos platform serving billion of users, we observed significant metrics wins, including tier 0 user retention and engagements, higher quality and more trustworthy content recommendations, and higher user satisfaction in surveys. Overall, this work demonstrates that improving the relevance of a ranking system by leveraging direct user survey feedback can be a promising solution to enhance personalization of large-scale ranking system and lead to user satisfaction. | Cayman Simpson, Drew Hogg, Mengxi Lv, Min Li, Senthil Rajagopalan, Shashank Bassi, Thomas Grubb | |||
| 9 | User Long-Term Multi-Interest Retrieval Model for Recommendation | 0 | User behavior sequence modeling, which captures user interest from rich historical interactions, is pivotal for industrial recommendation systems. Despite breakthroughs in ranking-stage models capable of leveraging ultra-long behavior sequences with length scaling up to thousands, existing retrieval models remain constrained to sequences of hundreds of behaviors due to two main challenges. One is strict latency budget imposed by real-time service over large-scale candidate pool. The other is the absence of target-aware mechanisms and cross-interaction architectures, which prevent utilizing ranking-like techniques to simplify long sequence modeling. To address these limitations, we propose a new framework named User Long-term Multi-Interest Retrieval Model(ULIM), which enables thousand-scale behavior modeling in retrieval stages. ULIM includes two novel components: 1)Category-Aware Hierarchical Dual-Interest Learning partitions long behavior sequences into multiple category-aware subsequences representing multi-interest and jointly optimizes long-term and short-term interests within specific interest cluster. 2)Pointer-Enhanced Cascaded Category-to-Item Retrieval introduces Pointer-Generator Interest Network(PGIN) for next-category prediction, followed by next-item retrieval upon the top-K predicted categories. Comprehensive experiments on Taobao dataset show that ULIM achieves substantial improvement over state-of-the-art methods, and brings 5.54 | Bo Zheng, Cheng Guo, Honghu Deng, Tong Liu, Xiaohui Hu, Yi Cao, Yue Meng | |||
| 10 | LLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation | 0 | Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without additional training or fine-tuning, addressing the limitations of traditional models in sparse data environments. Recent advancements in large language models (LLMs) have significantly enhanced ZCDSR by facilitating cross-domain knowledge transfer through rich, pretrained representations. Despite this progress, domain semantic bias – arising from differences in vocabulary and content focus between domains – remains a persistent challenge, leading to misaligned item embeddings and reduced generalization across domains. To address this, we propose a novel semantic bias-aware framework that enhances LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that aligns the embeddings of items across domains (inter-domain compactness), while preserving the unique characteristics of each item within its own domain (intra-domain diversity). This ensures that item embeddings can be transferred effectively between domains without collapsing into overly generic or uniform representations. At the sequential level, we develop a method to transfer user behavioral patterns by clustering source domain user sequences and applying attention-based aggregation during target domain inference. We dynamically adapt user embeddings to unseen domains, enabling effective zero-shot recommendations without requiring target-domain interactions... | Hari Sundaram, Junting Wang, Yunzhe Li, Zhining Liu | |||
| 11 | Test-Time Alignment with State Space Model for Tracking User Interest Shifts in Sequential Recommendation | 0 | Sequential recommendation is essential in modern recommender systems, aiming to predict the next item a user may interact with based on their historical behaviors. However, real-world scenarios are often dynamic and subject to shifts in user interests. Conventional sequential recommendation models are typically trained on static historical data, limiting their ability to adapt to such shifts and resulting in significant performance degradation during testing. Recently, Test-Time Training (TTT) has emerged as a promising paradigm, enabling pre-trained models to dynamically adapt to test data by leveraging unlabeled examples during testing. However, applying TTT to effectively track and address user interest shifts in recommender systems remains an open and challenging problem. Key challenges include how to capture temporal information effectively and explicitly identifying shifts in user interests during the testing phase. To address these issues, we propose T2ARec, a novel model leveraging state space model for TTT by introducing two Test-Time Alignment modules tailored for sequential recommendation, effectively capturing the distribution shifts in user interest patterns over time. Specifically, T2ARec aligns absolute time intervals with model-adaptive learning intervals to capture temporal dynamics and introduce an interest state alignment mechanism to effectively and explicitly identify the user interest shifts with theoretical guarantees. These two alignment modules enable efficient and incremental updates to model parameters in a self-supervised manner during testing, enhancing predictions for online recommendation. Extensive evaluations on three benchmark datasets demonstrate that T2ARec achieves state-of-the-art performance and robustly mitigates the challenges posed by user interest shifts. | Changshuo Zhang, JiRong Wen, Jun Xu, Teng Shi, Xiao Zhang | |||
| 12 | Disentangling User and Item Sequence Patterns in Sequential Recommendation Data Sets | 0 | Sequential recommenders use the ordering of user-item interactions to perform next-item prediction. Several studies have attempted to estimate how much sequential information is available in data sets used for the offline evaluation of sequential recommenders by randomly shuffling users’ interaction histories and breaking the sequential dependencies between interactions. However, random shuffling fails to distinguish between sequential patterns from user behaviour (users consuming items based on previous interactions) and item availability (when items enter the system and become available for user consumption). In this article, we analyse several widely used data sets in sequential recommendation studies using two shuffling techniques: random shuffling and constrained shuffling. While random shuffling reorders interactions arbitrarily, constrained shuffling does not allow user-item interactions to occur prior to the item’s first appearance in the data set. Our experiments show that sequential information can either come exclusively from user behaviour patterns or item availability, or from a combination of the two. These findings have implications for understanding evaluation results in sequential recommendation and highlights why some data sets may be less appropriate for offline evaluation given how little sequential information comes from user behaviour. | Alan Medlar, Dorota Glowacka, Kaiyue Liu, Yang Liu | |||
| 13 | Let It Go? Not Quite: Addressing Item Cold Start in Sequential Recommendations with Content-Based Initialization | 0 | Many sequential recommender systems suffer from the cold start problem, where items with few or no interactions cannot be effectively used by the model due to the absence of a trained embedding. Content-based approaches, which leverage item metadata, are commonly used in such scenarios. One possible way is to use embeddings derived from content features such as textual descriptions as initialization for the model embeddings. However, directly using frozen content embeddings often results in suboptimal performance, as they may not fully adapt to the recommendation task. On the other hand, fine-tuning these embeddings can degrade performance for cold-start items, as item representations may drift far from their original structure after training. We propose a novel approach to address this limitation. Instead of entirely freezing the content embeddings or fine-tuning them extensively, we introduce a small trainable delta to frozen embeddings that enables the model to adapt item representations without letting them go too far from their original semantic structure. This approach demonstrates consistent improvements across multiple datasets and modalities, including e-commerce datasets with textual descriptions and a music dataset with audio-based representation. | Alexey Vasilev, Anton Klenitskiy, Anton Pembek, Artem Fatkulin | |||
| 14 | DistillRecDial: A Knowledge-Distilled Dataset Capturing User Diversity in Conversational Recommendation | 0 | Conversational Recommender Systems (CRSs) facilitate item discovery through multi-turn dialogues that elicit user preferences via natural language interaction. This field has gained significant attention following advancements in Natural Language Processing (NLP) enabled by Large Language Models (LLMs). However, current CRS research remains constrained by datasets with fundamental limitations. Human-generated datasets suffer from inconsistent dialogue quality, limited domain expertise, and insufficient scale for real-world application, while synthetic datasets created with proprietary LLMs ignore the diversity of real-world user behavior and present significant barriers to accessibility and reproducibility. The development of effective CRSs depends critically on addressing these deficiencies. To this end, we present DistillRecDial, a novel conversational recommendation dataset generated through a knowledge distillation pipeline that leverages smaller, more accessible open LLMs. Crucially, DistillRecDial simulates a range of user types with varying intentions, preference expression styles, and initiative levels, capturing behavioral diversity that is largely absent from prior work. Human evaluation demonstrates that our dataset significantly outperforms widely adopted CRS datasets in dialogue coherence and domain-specific expertise, indicating its potential to advance the development of more realistic and effective conversational recommender systems. | Alessandro Francesco Maria Martina, Alessandro Petruzzelli, Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, Pasquale Lops | |||
| 15 | In-context Learning for Addressing User Cold-start in Sequential Movie Recommenders | 0 | The user cold-start problem remains a fundamental challenge for sequential recommender systems, particularly in large-scale video streaming services where a substantial portion of users have limited or no historical interaction data. In this work, we formulate an attempt at solving this issue by proposing a framework that leverages Large Language Models (LLMs) to enrich interaction histories using user metadata. Our approach generates a set of imaginary video items relevant to a given user’s demographic, represented through structured item key-value attributes. The generated items are inserted into users’ interaction sequences using early or late fusion strategies. We find that the generated user histories enable better initial user profiling for absolute cold users and enhanced preference modeling for nearly cold users. Experimental results on the public ML-1M dataset and an internal dataset from an Amazon streaming service demonstrate the effectiveness of our LLM-based augmentation method in mitigating cold-start challenges. | Julien Monteil, Paul Albert, Vu Nguyen, Vuong Le, Xurong Liang | |||
| 16 | Leveraging Explicit Negative Feedback in Large-Scale Recommendation Systems: A Case Study | 0 | What users dislike can be just as important as what they engage with, yet explicit negative user feedback remains underutilized in most recommendation systems. This paper presents practical approaches for capturing such feedback through lightweight, context-aware surveys and in-feed interactions. Referencing a case study on large-scale implementations at TikTok, we demonstrate how incorporating user feedback signals, once denoised and modeled, can improve feed quality, content relevance, and long-term user engagement. Our findings highlight that even small, well-designed feedback mechanisms can meaningfully improve user experience and trust. | Bingfeng Deng, Hongyu Xiong, Madhura Raju, Manisha Sharma, Meng Na | |||
| 17 | IP2: Entity-Guided Interest Probing for Personalized News Recommendation | 0 | News recommender systems aim to provide personalized news reading experiences for users based on their reading history. Behavioral science studies suggest that screen-based news reading contains three successive steps: scanning, title reading, and then clicking. Adhering to these steps, we find that intra-news entity interest dominates the scanning stage, while the inter-news entity interest guides title reading and influences click decisions. Unfortunately, current methods overlook the unique utility of entities in news recommendation. To this end, we propose a novel method called IP2 to probe entity-guided reading interest at both intra- and inter-news levels. At the intra-news level, a Transformer-based entity encoder is devised to aggregate mentioned entities in the news title into one signature entity. Then, a signature entity-title contrastive pre-training is adopted to initialize entities with proper meanings using the news story context, which in the meantime facilitates us to probe for intra-news entity interest. As for the inter-news level, a dual tower user encoder is presented to capture inter-news reading interest from both the title meaning and entity sides. In addition to highlighting the contribution of inter-news entity guidance, a cross-tower attention link is adopted to calibrate title reading interest using inter-news entity interest, thus further aligning with real-world behavior. Extensive experiments on two real-world datasets demonstrate that our IP2 achieves state-of-the-art performance in news recommendation. | Bo Xu, Haoxi Zhan, Hongfei Lin, Liang Yang, Xiaokun Zhang, Youlin Wu, Yuanyuan Sun | |||
| 18 | LEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models | 0 | Deep Learning Recommendation Models (DLRMs) are central to enhancing user engagement and experience with internet and e-commerce companies. DLRMs provide content and commercial suggestions by modeling user behavior. DLRMs rely on embedding tables to capture the user behavior, where users with similar interests may be represented closer in the embedding space. Embedding tables scale to tens of terabytes as the number of users and features grows, presenting challenges in training and storage. These models typically require substantial GPU memory, as embedding operations are not compute-intensive but occupy significant storage. While some solutions have explored offloading embedding tables to CPU, this approach still demands terabytes of memory and places a significant burden on CPU-GPU interconnect. We introduce LEAF, a multi-level hashing framework that compresses the large embedding tables based on real-time access frequency distribution. In particular, LEAF leverages a streaming algorithm to estimate access distributions on the fly without relying on model gradients or requiring a priori knowledge of access distribution. By using multiple hash functions, LEAF minimizes the collision rates of feature instances. Experiments show that LEAF outperforms state-of-the-art compression methods on Criteo Kaggle, Avazu, KDD12, and Criteo Terabyte datasets, with testing AUC improvements of (1.411%), (1.885%), (2.761%), and (1.243%), respectively. The source code of LEAF is available at github.com/chaoyij/LEAF. | Abdulla Alshabanah, Chaoyi Jiang, Murali Annavaram | |||
| 19 | Collaborative Interest Modeling in Recommender Systems | 0 | In this paper, we introduce Collaborative Interest Modeling (COIN), a novel approach to tackle interest entanglement and sparse interest representations within multi-interest learning for recommender systems. COIN leverages collaborative signals from behaviorally similar interests to refine interest embeddings and enhance recommendation quality, unlike existing methods that primarily focus on individual user-item interactions. The approach aligns collaborative neighbors with sparse interests, employs a structured routing mechanism to distinguish multiple interests, and avoids routing collapse. Experimental results on three real-world datasets demonstrate that COIN outperforms state-of-the-art models by 4.71% to 15.13% in key recommendation metrics, such as recall, NDCG, and hit ratio. | JyunYu Jiang, YuTing Cheng, YuYen Ho | |||
| 20 | Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items | 0 | In Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluation. Such user simulators critique the current retrieved item based on knowledge of a single target item. However, system evaluation in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a new dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulatedusers are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user. | Maria Vlachou | |||
| 21 | Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders | 0 | Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split | Alexey Vasilev, Anna Volodkevich, Anton Klenitskiy, Danil Gusak, Evgeny Frolov | |||
| 22 | "We Share Our Code Online": Why This Is Not Enough to Ensure Reproducibility and Progress in Recommender Systems Research | 0 | Issues with reproducibility have been identified as a major factor hampering progress in recommender systems research. In response, researchers increasingly share the code of their models. However, the provision of only the code of the proposed model is usually not sufficient to ensure reproducibility. In many works, the central claim is that a new model is advancing the state of the art. Thus, it is crucial that the entire experiment is reproducible, including the configuration and the results of the considered baselines. With this work, our goal is to gauge the level of reproducibility in algorithms research in recommender systems. We systematically analyzed the reproducibility level of 65 papers published at a top-ranked conference during the last three years. Our results are sobering. While the model code is shared in about two thirds of the papers, the code of the baselines is provided only in eight cases. The hyperparameters of the baselines are reported even less frequently, and how these were exactly determined is not explained in any paper. As a result, it is commonly not only impossible to reproduce the full result tables reported in the papers, it is also unclear if the claimed improvements over the state of the art were actually achieved. Overall, we conclude that the research community has not reached the required level of reproducibility yet. We therefore call for more rigorous reproducibility standards to ensure progress in this field. | Dietmar Jannach, Faisal Shehzad, Maria Maistro, Timo Breuer | |||
| 23 | Yambda-5B - A Large-Scale Multi-Modal Dataset for Ranking and Retrieval | 0 | We present Yambda-5B, a large-scale open dataset sourced from the Yandex.Music streaming platform. Yambda-5B contains 4.79 billion user-item interactions from 1 million users across 9.39 million tracks. The dataset includes two primary types of interactions: implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes and undislikes). In addition, we provide audio embeddings for most tracks, generated by a convolutional neural network trained on audio spectrograms. A key distinguishing feature of Yambda-5B is the inclusion of the is_organic flag, which separates organic user actions from recommendation-driven events. This distinction is critical for developing and evaluating machine learning algorithms, as Yandex.Music relies on recommender systems to personalize track selection for users. To support rigorous benchmarking, we introduce an evaluation protocol based on a Global Temporal Split, allowing recommendation algorithms to be assessed in conditions that closely mirror real-world use. We report benchmark results for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using a variety of evaluation metrics. By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems. | Alexander Ploshkin, Alexey Pismenny, Artem Permiakov, Daniil Burlakov, Eugene Krofto, Evgeny Taychinov, Vladimir Baikalov, Vladislav Tytskiy | |||
| 24 | An Analysis of Learned Product Embeddings in an E-Commerce Context | 0 | Recommender systems often represent products with learnable embeddings. Yet, we seldom examine the structure of the embedding space, and what implications it has for the recommendation task at hand. In contrast, embeddings in natural language processing are well-understood and offer intuitive properties through word analogies (e.g. "queen - king = woman - man"). In this work, we present a corresponding approach that reveals latent knowledge in the structure of product embeddings. We prove their relevance in evaluating several embeddings learned from different data modalities in a home-furnishing context. Our findings evince distinct embedding strengths: visual embeddings capture explicit attributes like colour and shape; textual embeddings encode abstract concepts like style and functionality; while behavioural embeddings offer versatile representations driven by user interactions. We also highlight trade-offs, and link our evaluations to practical considerations in embedding development within the e-commerce domain. | Eva Giannatou, Martin Tegner, Mate Hartstein | |||
| 25 | Decoupled Entity Representation Learning for Pinterest Ads Ranking | 0 | In this paper, we introduce a novel framework following an upstream-downstream paradigm to construct user and item (Pin) embeddings from diverse data sources, which are essential for Pinterest to deliver personalized Pins and ads effectively. Our upstream models are trained on extensive data sources featuring varied signals, utilizing complex architectures to capture intricate relationships between users and Pins on Pinterest. To ensure scalability of the upstream models, entity embeddings are learned, and regularly refreshed, rather than real-time computation, allowing for asynchronous interaction between the upstream and downstream models. These embeddings are then integrated as input features in numerous downstream tasks, including ad retrieval and ranking models for CTR and CVR predictions. We demonstrate that our framework achieves notable performance improvements in both offline and online settings across various downstream tasks. This framework has been deployed in Pinterest's production ad ranking systems, resulting in significant gains in online metrics. | Han Sun, Haoyang Li, Huasen Wu, Jiankai Sun, Jie Liu, Kungang Li, Ling Leng, Nan Li, Paulo Soares, Prathibha Deshikachar, Sihan Wang, Siping Ji, Siyuan Gao, Yinrui Li, Zhifang Liu | |||
| 26 | Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest | 0 | The ranking utility function in an ad recommender system, which linearly combines predictions of various business goals, plays a central role in balancing values across the platform, advertisers, and users. Traditional manual tuning, while offering simplicity and interpretability, often yields suboptimal results due to its unprincipled tuning objectives, the vast amount of parameter combinations, and its lack of personalization and adaptability to seasonality. In this work, we propose a general Deep Reinforcement Learning framework for Personalized Utility Tuning (DRL-PUT) to address the challenges of multi-objective optimization within ad recommender systems. Our key contributions include: 1) Formulating the problem as a reinforcement learning task: given the state of an ad request, we predict the optimal hyperparameters to maximize a pre-defined reward. 2) Developing an approach to directly learn an optimal policy model using online serving logs, avoiding the need to estimate a value function, which is inherently challenging due to the high variance and unbalanced distribution of immediate rewards. We evaluated DRL-PUT through an online A/B experiment in Pinterest's ad recommender system. Compared to the baseline manual utility tuning approach, DRL-PUT improved the click-through rate by 9.7 | Abe Engle, Charles Rosenberg, Fan Zhou, Jiajing Xu, Jinfeng Zhuang, Ling Leng, Longyu Zhao, Mehdi Ayed, Prathibha Deshikachar, Xiao Yang, Yuchen Shen | |||
| 27 | Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID | 0 | The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments. | Carolina Zheng, Dinesh Ramasamy, Dmitrii Pedchenko, Fan Xia, Gaby Nahum, Hangjun Xu, Jie Lei, Jiyan Yang, Kaushik Rangadurai, Lin Yang, Minhui Huang, Rong Jin, Shuang Yang, Siyu Wang, Tao Liu, Xiaohan Wei, Yang Yang, Yiping Han, Zutian Luo | |||
| 28 | RankGraph: Unified Heterogeneous Graph Learning for Cross-Domain Recommendation | 0 | Cross-domain recommendation systems face the challenge of integrating fine-grained user and item relationships across various product domains. To address this, we introduce RankGraph, a scalable graph learning framework designed to serve as a core component in recommendation foundation models (FMs). By constructing and leveraging graphs composed of heterogeneous nodes and edges across multiple products, RankGraph enables the integration of complex relationships between users, posts, ads, and other entities. Our framework employs a GPU-accelerated Graph Neural Network and contrastive learning, allowing for dynamic extraction of subgraphs such as item-item and user-user graphs to support similarity-based retrieval and real-time clustering. Furthermore, RankGraph integrates graph-based pretrained representations as contextual tokens into FM sequence models, enriching them with structured relational knowledge. RankGraph has demonstrated improvements in click (+0.92 | Hong Li, Hong Yan, Junjie Yang, Li Chen, Li Yu, Renzhi Wu | |||
| 29 | Unified Survey Modeling to Limit Negative User Experiences in Recommendation Systems | 0 | Reducing negative user experiences is crucial for the success of recommendation platforms. Exposure to inappropriate content can not only harm users’ psychological well-being but also drive them away, ultimately undermining the platform’s long-term growth. However, recommendation algorithms often prioritize positive feedback signals due to the relative scarcity of negative ones, which may lead to the oversight of valuable negative user feedback. In this paper, we propose a method that leverages in-feed surveys to collect user feedback, models this feedback, and integrates the predictions into the recommendation system. We enhance the personalized survey model based on the HoME framework. Our experiments demonstrate that the proposed method significantly outperforms the baseline model. We observed an averaged 0.52% AUC increase and 1.38% LogLoss decline across all heads. After deploying the model on the TikTok app, we observe 0.82% and 0.67% increase in (survey_{-}!like_{-}!rate) and Like, a 4.08%, 2.51%, 2.59% reduction in (survey_{-}!inappropriate_{-}!rate), Reports, Dislikes, respectively, illustrating the improvement of the overall recommendation quality and decline in negative signals. | Bingfeng Deng, Chenghui Yu, Haoze Wu, Hongyu Xiong, Jian Ding | |||
| 30 | USD: A User-Intent-Driven Sampling and Dual-Debiasing Framework for Large-Scale Homepage Recommendations | 0 | Large-scale homepage recommendations face critical challenges from pseudo-negative samples caused by exposure bias, where non-clicks may indicate inattention rather than disinterest. Existing work lacks thorough analysis of invalid exposures and typically addresses isolated aspects (e.g., sampling strategies), overlooking the critical impact of pseudo-positive samples - such as homepage clicks merely to visit marketing portals. We propose a unified framework for large-scale homepage recommendation sampling and debiasing. Our framework consists of two key components: (1) a user intent-aware negative sampling module to filter invalid exposure samples, and (2) an intent-driven dual-debiasing module that jointly corrects exposure bias and click bias. Extensive online experiments on Taobao demonstrate the efficacy of our framework, achieving significant improvements in user click-through rates (UCTR) by 35.4% and 14.5% in two variants of the marketing block on the Taobao homepage, Baiyibutie and Taobaomiaosha. | Bo Zheng, Chaoqun Hou, Cheng Guo, Jiaqi Zheng, Tong Liu, Yi Cao | |||
| 31 | You Say Search, I Say Recs: A Scalable Agentic Approach to Query Understanding and Exploratory Search at Spotify | 0 | On online content platforms, users often aim to explore the catalog and discover new, personalized content through exploratory searches—such as “new releases for me.” Traditional search systems, which prioritize lexical and semantic matching over personalized retrieval, have historically struggled to support this type of intent. In contrast, recommendation services that leverage user-item and item-item signals tend to be more effective for addressing exploratory queries. Agentic technologies offer a promising opportunity to enhance exploratory search by harnessing large language models (LLMs) to interpret complex query intents and route them to the most suitable downstream services. However, deploying such agentic systems at scale remains a significant challenge. In this paper, we present a scalable agentic approach to query understanding and exploratory search at Spotify. Our system combines an LLM router, post-training adaptation techniques, search and recommendation APIs, and specialized sub-agents to interpret user intent and deliver personalized results at scale. We outline the high-level system design and share key experimental results. By addressing the limitations of conventional search, our approach yields substantial improvements across several exploratory use cases, including discovering similar artists (+115%), broad podcast searches (+15%), new music releases (+91%), and broad music searches (+25%). | Alexandre Tamborrino, Ali Vardasbi, Anders Nyman, Catalin Dincu, Christine Doig Cardet, Dani Doro, Enrico Palumbo, Hugues Bouchard, Lev Nikeshkin, Marcus Isaksson, Maria Movin, Mounia Lalmas, Oksana Gorobets, Paul N. Bennett, Poppy Newdick, Ziad Sultan | |||
| 32 | Benefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns | 0 | We consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items. | Alexey Vasilev, Evgeny Frolov, Veronika Ivanova | |||
| 33 | Beyond Clicks: Eye-Tracking Insights into User Responses to Different Recommendation Types | 0 | Modern recommender systems increasingly rely on implicit human feedback to enhance recommendation quality, personalization, and user engagement. In e-commerce, eye-tracking has emerged as a valuable tool for capturing attention and preference, yet little work has explored how users behave across different recommendation categories. In this study, we analyse eye-tracking data from users exposed to four recommendation types—Exact, Substitute, Complement, and Irrelevant—in a query-based setting. Our results reveal consistent patterns: users exhibit predictable, text-focused viewing for Exact and Substitute items, while Complement and Irrelevant items trigger more distributed, exploratory behaviour. Notably, Irrelevant items elicit higher emotional arousal associated with disengagement—a pattern not seen with Complement items, suggesting the latter may increase diversity without harming user experience. These findings highlight the importance of considering recommendation context in user modelling, and provide a foundation for future work on context-aware recommender systems and the use of eye-tracking data. | Georgios Koutroumpas, Ioannis Arapakis, Joemon M. Jose, Matteo Mazzini, Mireia Masias Bruns, Sebastian Idesis, Sergi Abadal | |||
| 34 | Semantic IDs for Joint Generative Search and Recommendation | 0 | Generative models powered by Large Language Models (LLMs) are emerging as a unified solution for powering both recommendation and search tasks. A key design choice in these models is how to represent items, traditionally through unique identifiers (IDs) and more recently with Semantic IDs composed of discrete codes, obtained from embeddings. While task-specific embedding models can improve performance for individual tasks, they may not generalize well in a joint setting. In this paper, we explore how to construct Semantic IDs that perform well both in search and recommendation when using a unified model. We compare a range of strategies to construct Semantic IDs, looking into task-specific and cross-tasks approaches, and also whether each task should have its own semantic ID tokens in a joint search and recommendation generative model. Our results show that using a bi-encoder model fine-tuned on both search and recommendation tasks to obtain item embeddings, followed by the construction of a unified Semantic ID space provides an effective trade-off, enabling strong performance in both tasks. We hope these findings spark follow-up work on generalisable, semantically grounded ID schemes and inform the next wave of unified generative recommender architectures. | Alexandre Tamborrino, Ali Vardasbi, Edoardo D'Amico, Enrico Palumbo, Francesco Fabbri, Gustavo Penha, Hugues Bouchard, Marco De Nadai, Max Lefarov, Shawn Lin, Timothy Christopher Heath | |||
| 35 | Unobserved Negative Items in Recommender Systems: Challenges and Solutions for Evaluation and Learning | 0 | Properly conducting offline evaluation is crucial for recommender systems. While sampling negative items has traditionally been employed for its efficiency in evaluation, recent studies have highlighted the limitations of this approach, fostering researchers to adopt a more cautious stance toward item-sampling evaluation. However, even in the absence of intentional sampling, negative items may still be missing. This issue arises because typical implicit feedback datasets contain only items that have been interacted with by at least one user in the dataset. Consequently, the included items may not encompass the entire catalog of items that serve as true candidate items during online deployment. In this paper, we investigate the impact of missing candidate items on both the evaluation and learning processes of recommender systems. Our findings demonstrate that missing candidate items lead to the overestimation of model performance and inconsistencies in identifying superior models. Moreover, their absence significantly impairs model training. To address this challenge, we propose evaluation and learning methods based on inverse probability weighting, complemented by a novel protocol for estimating the probabilities of missing items. We show that the proposed evaluation methods recover metrics that closely approximate their true values. Furthermore, the proposed learning method yields a more robust model, even when candidate items are missing from the training data. | Masahiro Sato | |||
| 36 | VisualReF: Interactive Image Search Prototype with Visual Relevance Feedback | 0 | In the absence of interaction history, image recommendations often depend on content-based approaches. Prompted by user queries in natural language, such systems rank items based on the similarity between textual and visual features. However, these approaches typically rely on static queries and do not offer alternative feedback mechanisms. In this paper, we present VisualReF: an interactive image retrieval prototype that introduces visual relevance feedback through fine-grained user annotations. Built on vision-language models (VLMs) for retrieval, our system allows users to label relevant and irrelevant regions in retrieved images. These regions are captioned using a generative vision-language model to refine the query vector. Our work bridges the gap between conventional static image retrieval and interactive, user-guided search by introducing visual relevance feedback. Finally, our prototype contributes to the field of visual recommendation by empowering researchers with practical tools for: (i) collecting region-level visual relevance signals from users, (ii) supporting integration of human feedback into interactive search pipelines, and (iii) explaining how the relevance feedback model perceives user input. | Bulat Khaertdinov, Mirela Popa, Nava Tintarev | |||
| 37 | Advancing User-Centric Evaluation and Design of Conversational Recommender Systems | 0 | Conversational Recommender Systems (CRS) are rapidly evolving with advancements in large language models (LLMs), enabling richer, more adaptive user interactions. However, existing evaluation practices remain largely system-centric, underestimating nuanced factors like conversational quality, empathy, and real-world user satisfaction. This doctoral research aims to bridge that gap by advancing holistic, user-centric evaluation frameworks for CRS. The work pursues four directions: (1) identifying key drivers of user satisfaction through targeted user studies and dataset analyses; (2) systematically investigating LLMs as annotators and user simulators to support scalable CRS assessment; (3) developing scalable, standardized evaluation protocols that balance objective accuracy with subjective conversational experience; and (4) deriving actionable design guidelines by comparing strategies for preference elicitation and context integration. Ultimately, this research seeks to provide reproducible methods, and evidence-based guidance to foster the development of CRS that genuinely center the user. | Michael Müller | |||
| 38 | A Multi-Factor Collaborative Prediction for Review-based Recommendation | 0 | For items, the higher the click-through rate, the higher the rating. Thus, existing recommendation methods implicitly model click behaviors by learning user preferences and achieving accurate predictions on rating prediction tasks. However, they ignore the help of the rating behaviors for the click-through rate prediction task (CTR). Although the rating behavior occurs after the click behavior, we can still get helpful information about clicks from ratings. In this paper, we propose a multi-factor collaborative prediction method (MFC), which mines the complex relationship between click and rating behaviors, achieving accurate prediction on CTR tasks. Specifically, we factorize the complex relationship into three simple relationships, i.e., linear, sharing, and cross-correlation relationships. Thus, MFC first extracts click factors, rating factors, and their sharing factor from user click and rating behaviors with user reviews, as review-based methods have achieved great results on rating predictions. Then, a rating factor regularization method is used to learn rating factors accurately, helping to model the true relationships between click and rating behavior. Finally, MFC combines those three factors to make predictions, while click and rating factors are used to model the linear and cross-correlation relationships, and the sharing factors correspond to the sharing relation. Experiments on five real-world datasets demonstrate that MFC outperforms the best baseline by (9.19%), (9.80%), (0.69%), and (7.95%), in terms of Accuracy, Precision, Recall, and F1-score, respectively. MFC also reduces the MAE of the rating prediction task by (1.92%). The source code is available at https://github.com/dianziliu/MFC. | Junrui Liu, Mingliang Yu, Shiqiu Yang, Tong Li, Zhen Yang, Zifang Tang | |||
| 39 | Enhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation | 0 | Nowadays, reading or writing comments on captivating videos has emerged as a critical part of the viewing experience on online video platforms. However, existing recommender systems primarily focus on users’ interaction behaviors with videos, neglecting comment content and interaction in user preference modeling. In this paper, we propose a novel recommendation approach called LSVCR that utilizes user interaction histories with both videos and comments to jointly perform personalized video and comment recommendation. Specifically, our approach comprises two key components: sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model functions as the primary recommendation backbone (retained in deployment) of our method for efficient user preference modeling. Concurrently, we employ a LLM as the supplemental recommender (discarded in deployment) to better capture underlying user preferences derived from heterogeneous interaction behaviors. In order to integrate the strengths of the SR model and the supplemental LLM recommender, we introduce a two-stage training paradigm. The first stage, personalized preference alignment, aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage, recommendation-oriented fine-tuning, involves fine-tuning the alignment-enhanced SR model according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Moreover, online A/B testing on KuaiShou platform verifies the practical benefits of our approach. In particular, we attain a cumulative gain of 4.13% in comment watch time. | Bowen Zheng, Chen Yang, Cheng Ling, Enyang Bai, Enze Liu, Han Li, JiRong Wen, Wayne Xin Zhao, Zihan Lin | |||
| 40 | Mapping Stakeholder Needs to Multi-Sided Fairness in Candidate Recommendation for Algorithmic Hiring | 0 | Already before the enactment of the EU AI Act, candidate or job recommendation for algorithmic hiring—semi-automatically matching CVs to job postings—was used as an example of a high-risk application where unfair treatment could result in serious harms to job seekers. Recommending candidates to jobs or jobs to candidates, however, is also a fitting example of a multi-stakeholder recommendation problem. In such multi-stakeholder systems, the end user is not the only party whose interests should be considered when generating recommendations. In addition to job seekers, other stakeholders—such as recruiters, organizations behind the job postings, and the recruitment agency itself—are also stakeholders in this and deserve to have their perspectives included in the design of relevant fairness metrics. Nevertheless, past analyses of fairness in algorithmic hiring have been restricted to single-side fairness, ignoring the perspectives of the other stakeholders. In this paper, we address this gap and present a multi-stakeholder approach to fairness in a candidate recommender system that recommends relevant candidate CVs to human recruiters in a human-in-the-loop algorithmic hiring scenario. We conducted semi-structured interviews with 40 different stakeholders (job seekers, companies, recruiters, and other job portal employees). We used these interviews to explore their lived experiences of unfairness in hiring, co-design definitions of fairness as well as metrics that might capture these experiences. Finally, we attempt to reconcile and map these different (and sometimes conflicting) perspectives and definitions to existing (categories of) fairness metrics that are relevant for our candidate recommendation scenario. | Mesut Kaya, Toine Bogers | |||
| 41 | Modeling Long-term User Behaviors with Diffusion-driven Multi-interest Network for CTR Prediction | 0 | CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then adopt a diffusion module guided by contextual interests and interest channels, which anchor users' personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users' genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52 | Beihong Jin, Jian Dong, Jun Lei, Rui Zhao, Weijiang Lai, Xingxing Wang, Yapeng Zhang, Yiyuan Zheng | |||
| 42 | MoRE: A Mixture of Reflectors Framework for Large Language Model-Based Sequential Recommendation | 0 | Large language models (LLMs) have emerged as a cutting-edge approach in sequential recommendation, leveraging historical interactions to model dynamic user preferences. Current methods mainly focus on learning processed recommendation data in the form of sequence-to-sequence text. While effective, they exhibit three key limitations: 1) failing to decouple intra-user explicit features (e.g., product titles) from implicit behavioral patterns (e.g., brand loyalty) within interaction histories; 2) underutilizing cross-user collaborative filtering (CF) signals; and 3) relying on inefficient reflection update strategies. To address this, We propose MoRE (Mixture of REflectors), which introduces three perspective-aware offline reflection processes to address these gaps. This decomposition directly resolves Challenges 1 (explicit/implicit ambiguity) and 2 (CF underutilization). Furthermore, MoRE's meta-reflector employs a self-improving strategy and a dynamic selection mechanism (Challenge 3) to adapt to evolving user preferences. First, two intra-user reflectors decouple explicit and implicit patterns from a user's interaction sequence, mimicking traditional recommender systems' ability to distinguish surface-level and latent preferences. A third cross-user reflector captures CF signals by analyzing user similarity patterns from multiple users' interactions. To optimize reflection quality, MoRE's meta-reflector employs a offline self-improving strategy that evaluates reflection impacts through comparisons of presence/absence and iterative refinement of old/new versions, with a online contextual bandit mechanism dynamically selecting the optimal perspective for recommendation for each user. Code: https://github.com/E-qin/MoRE-Rec. | Chenglei Shen, Jianping Fan, Jun Xu, Ming He, Weicong Qin, Weijie Yu, Xiao Zhang, Yi Xu | |||
| 43 | Non-parametric Graph Convolution for Re-ranking in Recommendation Systems | 0 | Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, especially when richer contextual information in user-item interactions is available, remains underexplored. A major challenge lies in the substantial computational cost associated with repeatedly retrieving neighborhood information from billions of items stored in distributed systems. This resource-intensive requirement makes it difficult to scale graph-based methods in practical RecSys. To bridge this gap, we first demonstrate that incorporating graphs in the ranking stage improves ranking qualities. Notably, while the improvement is evident, we show that the substantial computational overheads entailed by graphs are prohibitively expensive for real-world recommendations. In light of this, we propose a non-parametric strategy that utilizes graph convolution for re-ranking only during test time. Our strategy circumvents the notorious computational overheads from graph convolution during training, and utilizes structural knowledge hidden in graphs on-the-fly during testing. It can be used as a plug-and-play module and easily employed to enhance the ranking ability of various ranking layers of a real-world RecSys with significantly reduced computational overhead. Through comprehensive experiments across four benchmark datasets with varying levels of sparsity, we demonstrate that our strategy yields noticeable improvements (i.e., 8.1 | Mingxuan Ju, Soroush Vosoughi, Yanfang Ye, Zhongyu Ouyang | |||
| 44 | Off-Policy Evaluation of Candidate Generators in Two-Stage Recommender Systems | 0 | We study offline evaluation of two-stage recommender systems, focusing on the first stage, candidate generation. Traditionally, candidate generators have been evaluated in terms of standard information retrieval metrics, using curated or heuristically labeled data, which does not always reflect their true impact to user experience or business metrics. We instead take a holistic view, measuring their effectiveness with respect to the downstream recommendation task, using data logged from past user interactions with the system. Using the contextual bandit formalism, we frame this evaluation task as off-policy evaluation (OPE) with a new action set induced by a new candidate generator. To the best of our knowledge, ours is the first study to examine evaluation of candidate generators through the lens of OPE. We propose two importance-weighting methods to measure the impact of a new candidate generator using data collected from a downstream task. We analyze the asymptotic properties of these methods and derive expressions for their respective biases and variances. This analysis illuminates a procedure to optimize the estimators so as to reduce bias. Finally, we present empirical results that demonstrate the estimators’ efficacy on synthetic and benchmark data. We find that our proposed methods achieve lower bias with comparable or reduced variance relative to baseline approaches that do not account for the new action set. | Amina Shabbeer, Ben London, Peiyao Wang, Zhan Shi | |||
| 45 | Paragon: Parameter Generation for Controllable Multi-Task Recommendation | 0 | Commercial recommender systems face the challenge that task requirements from platforms or users often change dynamically (e.g., varying preferences for accuracy or diversity). Ideally, the model should be re-trained after resetting a new objective function, adapting to these changes in task requirements. However, in practice, the high computational costs associated with retraining make this process impractical for models already deployed to online environments. This raises a new challenging problem: how to efficiently adapt the learned model to different task requirements by controlling the model parameters after deployment, without the need for retraining. To address this issue, we propose a novel controllable learning approach via parameter generation for controllable multi-task recommendation (Paragon), which allows the customization and adaptation of recommendation model parameters to new task requirements without retraining. Specifically, we first obtain the optimized model parameters through adapter tunning based on the feasible task requirements. Then, we utilize the generative model as a parameter generator, employing classifier-free guidance in conditional training to learn the distribution of optimized model parameters under various task requirements. Finally, the parameter generator is applied to effectively generate model parameters in a test-time adaptation manner given task requirements. Moreover, Paragon seamlessly integrates with various existing recommendation models to enhance their controllability. Extensive experiments on two public datasets and one commercial dataset demonstrate that Paragon can efficiently generate model parameters instead of retraining, reducing computational time by at least 94.6%. The code is released at https://anonymous.4open.science/r/Paragon-C726. | Chenglei Shen, Jiahao Zhao, Jianping Fan, Ming He, Weijie Yu, Xiao Zhang | |||
| 46 | USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model | 0 | Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods. | Cilin Yan, Jianyu Wen, Jiayin Cai, Jingyun Wang, Xiaolong Jiang, Ying Zhang | |||
| 47 | VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings | 0 | Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6 | Evren Körpeoglu, Jianpeng Xu, Kai Zhao, Kannan Achan, Kehui Yao, Ramin Giahi, Sriram Kollipara, Topojoy Biswas, Vahid Mirjalili | |||
| 48 | "Beyond the past": Leveraging Audio and Human Memory for Sequential Music Recommendation | 0 | On music streaming services, listening sessions are often composed of a balance of familiar and new tracks. Recently, sequential recommender systems have adopted cognitive-informed approaches, such as Adaptive Control of Thought-Rational (ACT-R), to successfully improve the prediction of the most relevant tracks for the next user session. However, one limitation of using a model inspired by human memory (or the past), is that it struggles to recommend new tracks that users have not previously listened to. To bridge this gap, here we propose a model that leverages audio information to predict in advance the ACT-R-like activation of new tracks and incorporates them into the recommendation scoring process. We demonstrate the empirical effectiveness of the proposed model using proprietary data, which we publicly release along with the model's source code to foster future research in this field. | Bruno Sguerra, Gabriel MeseguerBrocal, Léa Briand, Manuel Moussallam, VietAnh Tran | |||
| 49 | Failure Prediction in Conversational Recommendation Systems | 0 | In a Conversational Image Recommendation task, users can provide natural language feedback on a recommended image item, which leads to an improved recommendation in the next turn. While typical instantiations of this task assume that the user's target item will (eventually) be returned, this might often not be true, for example, the item the user seeks is not within the item catalogue. Failing to return a user's desired item can lead to user frustration, as the user needs to interact with the system for an increased number of turns. To mitigate this issue, in this paper, we introduce the task of Supervised Conversational Performance Prediction, inspired by Query Performance Prediction (QPP) for predicting effectiveness in response to a search engine query. In this regard, we propose predictors for conversational performance that detect conversation failures using multi-turn semantic information contained in the embedded representations of retrieved image items. Specifically, our AutoEncoder-based predictor learns a compressed representation of top-retrieved items of the train turns and uses the classification labels to predict the evaluation turn. Our evaluation scenario addressed two recommendation scenarios, by differentiating between system failure, where the system is unable to find the target, and catalogue failure, where the target does not exist in the item catalogue. In our experiments using the Shoes and FashionIQ Dresses datasets, we measure the accuracy of predictors for both system and catalogue failures. Our results demonstrate the promise of our proposed predictors for predicting system failures (existing evaluation scenario), while we detect a considerable decrease in predictive performance in the case of catalogue failure prediction (when inducing a missing item scenario) compared to system failures. | Maria Vlachou | |||
| 50 | Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation | 0 | Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks. | Alessandro B. Melchiorre, Anna Hausberger, Elena V. Epure, Gustavo Escobedo, Manuel Moussallam, Markus Schedl, Shahed Masoudian | |||
| 51 | Not Just What, But When: Integrating Irregular Intervals to LLM for Sequential Recommendation | 0 | Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history. In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals. Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 4.4 | Kei Tateno, Takuma Udagawa, WeiWei Du | |||
| 52 | Personalized Persuasion-Aware Explanations in Recommender Systems | 0 | With the increasing accuracy of recommender systems (RSs) in providing recommendations based on user preferences and past behaviors, there is a growing need for generating appropriate explanations to facilitate effective decision-making. Motivated by the recent trend of integrating social science theories into explainable RSs, this paper addresses the challenge of generating and evaluating personalized persuasion-aware explanations. While prior work mainly explores how users with different characteristics respond to persuasion-aware explanations, we build on these insights to construct a persuasion profile for each user and generate personalized persuasive explanations for items recommended by various RS baselines. We then evaluate these explanations from an explainability perspective, including metrics such as model fidelity. Additionally, we incorporate the persuasiveness degrees of generated explanations to re-order the recommendation list and investigate its impact on recommendation utility. Our experimental results on a real-world movie recommendation dataset demonstrate that the proposed approach effectively generates persuasive explanations for recommended items, while enhancing recommendation utility. The code is available at https://github.com/halizadehn/PPE. | Behshid Behkamal, Fattane Zarrinkalam, Havva Alizadeh Noughabi, Mohsen Kahani | |||
| 53 | Towards Personality-Aware Explanations for Music Recommendations Using Generative AI | 0 | It is well established that the provision of explanations can positively impact the effectiveness of a recommender system. In many proposals in the literature, these explanations are personalized in that they refer to a user’s known individual preferences. Some recent works, however, also indicate that personalization should also happen at a higher level, where the system, in a first step, decides in which specific way an explanation should be provided, depending, for example, on the user’s expertise. In this research, we take the first steps towards personality-aware explanations by exploring how users perceive explanations tailored to reflect the Big Five personality traits. To this purpose, we leverage the capabilities of modern Generative AI tools to create personality-based explanations at scale in the context of a music recommendation scenario. A linguistic analysis of the generated explanations confirms that they properly reflect expected language patterns associated with individual personality traits. Furthermore, a user study shows that users tend to prefer certain linguistic framings over others, for example, explanations that reflect low-neuroticism patterns. In addition, we find that some explanation forms are more effective than others regarding persuasiveness and perceived overall quality. | Dietmar Jannach, Gabrielle Alves, Luan Soares de Souza, Marcelo Garcia Manzato | |||
| 54 | TreatRAG: A Framework for Personalized Treatment Recommendation | 0 | Medication recommendation is a critical function of clinical decision support systems, directly influencing patient safety and treatment efficacy. While large language models (LLMs) show promise in clinical tasks such as summarization and question answering, their ability to make accurate treatment predictions remains limited, in part, due to their lack of specialized medical knowledge and exposure to real-world patient data. We introduce TreatRAG, an interpretable, model-agnostic retrieval-augmented generation (RAG) framework aimed at early-stage development to enhance medication recommendation accuracy using publicly available clinical data; thus, TreatRAG forms a critical foundational step toward future clinical validation and domain expert involvement. TreatRAG retrieves similar patient cases, i.e., so called "digital twins", using interpretable N-gram Jaccard similarity and augments the input prompt to ground LLM predictions in real clinical scenarios. We evaluate our framework on the MIMIC-IV dataset using BioGPT, BioMistral, Phi3, and Flan-T5. TreatRAG-enhanced BioGPT improves its F1-score from 0.14 to 0.34, BioMistral from 0.22 to 0.54, Phi-3 from 0.09 to 0.16, and Flan-T5 from 0.23 to 0.30, while also lowering, often significantly, the hallucination rate. Our model-agnostic framework offers a flexible, effective, and interpretable solution to advance the reliability of LLMs in clinical decision support. | ChaoChin Liu, DerChen Chang, HaoRen Yao, Ophir Frieder | |||
| 55 | Model Meets Knowledge: Analyzing Knowledge Types for Conversational Recommender Systems | 0 | Conversational Recommender Systems (CRSs) often integrate external knowledge to enhance user preference modeling and item representation learning, addressing the challenge of sparse conversational contexts. Traditional methods primarily utilize structured knowledge graphs (KGs) to model entity relationships and capture deep, multi-hop relationships among items. More recent studies employing pre-trained language models (PLMs), however, leverage unstructured text (e.g., customer reviews) to enrich contextual understanding of users and items. Despite reported performance gains from both knowledge types, a question remains: What is the compatibility between specific CRS model architectures and types of external knowledge, and how do different knowledge sources complement each other? We present a reproducibility study evaluating 9 state-of-the-art CRSs, including KG-based and PLM-based paradigms, to systematically investigate model–-knowledge compatibility and complementarity. Through a comprehensive evaluation on three datasets, we uncover three key findings: (1) Different model architectures have different compatibility with knowledge types: decoder-only models excel with structured knowledge, whereas encoder-decoder models better utilize unstructured knowledge. (2) Combining multiple knowledge sources isn’t always superior to using a single type, but merging similar knowledge types is generally more effective than mixing different ones. (3) Unstructured knowledge broadly benefits all scenario-specific conversations, particularly in genre-specific and descriptive scenarios, whereas structured knowledge demonstrates superior performance in comparative recommendation scenarios. Our study serves as an inspiration for future research on maximizing the benefits of external knowledge across different models in CRSs. | Jujia Zhao, Suzan Verberne, Yumeng Wang, Zhaochun Ren | |||
| 56 | Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation | 0 | Large language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, cross-domain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or data-limited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost. | Genki Kusano, Kosuke Akimoto, Kunihiro Takeoka | |||
| 57 | A Media Content Recommendation Method for Playlist Curators using LLM-Based Query Expansion | 0 | Playlist curation is a key factor in media content discovery services, yet finding diverse, relevant content is challenging for curators due to time-consuming manual query crafting. We propose a method where a large language model (LLM) expands a playlist theme into multiple diverse queries. Vectors from these expanded queries, along with the original theme vector, retrieve candidates via vector search. Experiments on Japanese TV programs show our method significantly improves precision over a theme-vector baseline, boosting Precision@10 from 0.79 to 0.98 and increasing P@50 by 22 percentage points. This approach enhances curator efficiency and improves playlist quality by delivering more accurate and diverse recommendations. | Arisa Fujii, Chigusa Yamamura, Hiromu Ogawa, Hisayuki Ohmata, Yuta Hagio | |||
| 58 | Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models | 0 | On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM)—which has demonstrated strong performance in image recognition and image-text retrieval tasks—to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems. | Andre Rusli, Ryo Watanabe, Sho Akiyama, Yuki Yada, Yusuke Shido, Yuta Ueno | |||
| 59 | Industry Insights from Comparing Deep Learning and GBDT Models for E-Commerce Learning-to-Rank | 0 | In e-commerce recommender and search systems, tree-based models, such as LambdaMART, have set a strong baseline for Learning-to-Rank (LTR) tasks. Despite their effectiveness and widespread adoption in industry, the debate continues whether deep neural networks (DNNs) can outperform traditional tree-based models in this domain. To contribute to this discussion, we systematically benchmark DNNs against our production-grade LambdaMART model. We evaluate multiple DNN architectures and loss functions on a proprietary dataset from OTTO and validate our findings through an 8-week online A/B test. The results show that a simple DNN architecture outperforms a strong tree-based baseline in terms of total clicks and revenue, while achieving parity in total units sold. | Philipp Duwe, Timo Wilm, Yunus Lutz | |||
| 60 | Location Matters: Leveraging Multi-Resolution Geo-Embeddings for Housing Search | 0 | QuintoAndar Group is Latin America’s largest housing platform, revolutionizing property rentals and sales. Headquartered in Brazil, it simplifies the housing process by eliminating paperwork and enhancing accessibility for tenants, buyers, and landlords. With thousands of houses available for each city, users struggle to find the ideal home. In this context, location plays a pivotal role, as it significantly influences property value, access to amenities, and life quality. A great location can make even a modest home highly desirable. Therefore, incorporating location into recommendations is essential for their effectiveness. We propose a geo-aware embedding framework to address sparsity and spatial nuances in housing recommendations on digital rental platforms. Our approach integrates an hierarchical H3 [5] grid at multiple levels into a two-tower neural architecture. We compare our method with a traditional matrix factorization baseline and a single-resolution variant using interaction data from our platform. Embedding specific evaluation reveals richer and more balanced embedding representations, while offline ranking simulations demonstrate a substantial uplift in recommendation quality. | Guilherme G. Bonaldo, Ivo Silva, Pedro F. Nogueira | |||
| 61 | Minimize Negative Experiences in Video Recommendation Systems with Multimodal Large Language Models | 0 | Detecting and limiting negative user experiences in recommendation systems with survey feedback modeling is difficult due to ultra-sparse, imbalanced, and noisy data. The proposed approach outlines fine-tuning a multimodal Large Language Model (MLLM) on survey data enriched with contextual information, like post engagement features and community data as a teacher model to generate silver labels. A highly negative ranking model (HNRM) is trained using both the original sparse survey labels and the generated silver labels knowledge distillation. This approach significantly improves model generalization, decreases calibration error rate, increases engagement while reducing negative experiences measured by survey negative experience rates in online A/B tests, and allows the model to scale beyond the limitations imposed by the original sparse and noisy dataset. | Liang Liu, Suman Malani, Youwei Zhang | |||
| 62 | Orthogonal Low Rank Embedding Stabilization | 0 | The instability of embedding spaces across model retraining cycles presents significant challenges to downstream applications using user or item embeddings derived from recommendation systems as input features. This paper introduces a novel orthogonal low-rank transformation methodology designed to stabilize the user/item embedding space, ensuring consistent embedding dimensions across retraining sessions. Our approach leverages a combination of efficient low-rank singular value decomposition and orthogonal Procrustes transformation to map embeddings into a standardized space. This transformation is computationally efficient, lossless, and lightweight, preserving the dot product and inference quality while reducing operational burdens. Unlike existing methods that modify training objectives or embedding structures, our approach maintains the integrity of the primary model application and can be seamlessly integrated with other stabilization techniques. | Kevin Zielnicki, KoJen Hsiao | |||
| 63 | Practical Multi-Task Learning for Rare Conversions in Ad Tech | 0 | We present a Multi-Task Learning (MTL) approach for improving predictions for rare (e.g., <1 | Natalia Silberstein, Ophir Friedler, Yonatan Karni, Yulia Stolin, Yuval Dishi | |||
| 64 | Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers | 0 | Large-scale recommendation systems are pivotal to process an immense volume of daily user interactions, requiring the effective modeling of high cardinality and heterogeneous features to ensure accurate predictions. In prior work, we introduced Hierarchical Sequential Transducers (HSTU), an attention-based architecture for modeling high cardinality, non-stationary streaming recommendation data, providing good scaling law in the generative recommender framework (GR). Recent studies and experiments demonstrate that attending to longer user history sequences yields significant metric improvements. However, scaling sequence length is activation-heavy, necessitating parallelism solutions to effectively shard activation memory. In transformer-based LLMs, context parallelism (CP) is a commonly used technique that distributes computation along the sequence-length dimension across multiple GPUs, effectively reducing memory usage from attention activations. In contrast, production ranking models typically utilize jagged input tensors to represent user interaction features, introducing unique CP implementation challenges. In this work, we introduce context parallelism with jagged tensor support for HSTU attention, establishing foundational capabilities for scaling up sequence dimensions. Our approach enables a 5.3x increase in supported user interaction sequence length, while achieving a 1.55x scaling factor when combined with Distributed Data Parallelism (DDP). | Chuanhao Zhuge, Han Li, Nikhil Patel, Shen Li, Xiaodong Wang, Xing Liu, Yue Dong | |||
| 65 | SEMORec: A Scalarized Efficient Multi-Objective Recommendation Framework | 0 | Recommendation systems in multi-stakeholder environments often require optimizing for multiple objectives simultaneously to meet supplier and consumer demands. Serving recommendations in these settings relies on efficiently combining the objectives to address each stakeholder’s expectations, often through a scalarization function with pre-determined and fixed weights. In practice, selecting these weights becomes a consequent problem. Recent work has developed algorithms that adapt these weights based on application-specific needs by using RL to train a model [6]. While this solves for automatic weight computation, such approaches are not efficient for frequent weight adaptation. They also do not allow for human intervention oftentimes determined by business needs. To bridge this gap, we propose a novel multi-objective recommendation framework that is efficient for a small number of objectives. It also enables business decision makers to easily tune the optimization by assigning different importance to multiple objectives. We demonstrate the efficacy and efficiency of our framework through improvements in online business metrics. | Humeyra Topcu Altintas, Siyong Ma, Sofia Maria Nikolakaki, Srivas Chennu | |||
| 66 | Suggest, Complement, Inspire: Story of Two-Tower Recommendations at Allegro.com | 0 | Aleksandra Maria OsowskaKurczab, Eliska Kremenová, Klaudia Nazarko, Lidia Wojciechowska, Mateusz Marzec | ||||
| 67 | A Dual-Key Attention Framework for Sequential Recommendation with Side Information | 0 | Sequential recommendation (SR) aims to predict users’ future interactions based on their historical behavior. Recently, deep learning-based SR models leveraging side information have gained considerable attention. Within these systems, items can be viewed from relation-based and attribute-based perspectives. The relation-based perspective characterizes items based on implicit relationships and contextual dependencies derived from user interactions. The attribute-based perspective defines items using inherent properties, such as category or genre. However, these perspectives are inherently entangled, making separate learning challenging. To address this issue, we propose a dual-key attention framework for sequential recommendation (DK-SR), which effectively learns both relation-based and attribute-based representations. DK-SR employs an attention mechanism with dual keys: one for item-level attention, facilitating relation-based representation learning, and another for attribute-level attention, enhancing attribute-based representation. Extensive experiments on four real-world datasets demonstrate that our model outperforms six state-of-the-art SR models leveraging side information. Additionally, an ablation study validates the contribution of the dual-key mechanism. | Chie Hoon Song, GunWoo Kim, Minje Kim, SangMin Choi, Suwon Lee, Wooseung Kang | |||
| 68 | Don't Get Ahead of Yourself: A Critical Study on Data Leakage in Offline Evaluation of Sequential Recommenders | 0 | While previous studies have investigated data leakage in recommendation, their findings have had little impact on research practice. These studies show that data leakage exists, it can inflate evaluation metrics, and may cause pathological outcomes, such as models predicting items from the future. However, temporal leave-one-out, the data splitting strategy most widely used to evaluate sequential recommenders, remains prevalent even though it is known to suffer from data leakage. We found ourselves asking the question: if so many researchers appear unconcerned with data leakage, maybe it’s not such a big deal? In this article, we investigate data leakage in offline evaluation of sequential recommenders. We compare temporal leave-one-out with split-by-timepoint leave-one-out, a comparable data splitting strategy that prevents data leakage. Across four data sets, we show that sampled nDCG@10 drops by 21.7-(73.4%) with split-by-timepoint leave-one-out. This performance drop is primarily due to the absence of data leakage as controlling for training set size between data splitting strategies yields similar results. Our work highlights the severity of data leakage in sequential recommendation studies and suggests a need to reconsider current research practices and to question the veracity of prior studies. | Alan Medlar, Dorota Glowacka, Huy Hoang Le, Yang Liu | |||
| 69 | End-to-End Time Interval-wise Segmentation for Sequential Recommendation | 0 | Sequential recommendation aims to predict a user’s next interaction based on their historical behavior. While recent models have achieved remarkable success, they often overlook time intervals between interactions or rely on fixed thresholds for session segmentation, which can lead to suboptimal results. To address these limitations, several approaches incorporate time intervals via relative positional embeddings or session segmentation based on fixed thresholds. However, these methods are highly sensitive to threshold selection and are prone to inaccurate segmentation. Inspired by these challenges, we propose TiSRec, a Time Interval-wise Segmentation framework that dynamically divides user sequences into Local Preference Blocks (LPBs) by selecting significant time intervals. TiSRec captures evolving user preferences through intra-block and inter-block encoders. Experiments on four real-world datasets demonstrate that TiSRec consistently outperforms state-of-the-art methods, and ablation studies confirm the effectiveness of LPB-based modeling. | Chie Hoon Song, GunWoo Kim, Minje Kim, SangMin Choi, Suwon Lee, Wooseung Kang | |||
| 70 | Parameter-Efficient Single Collaborative Branch for Recommendation | 0 | Recommender Systems (RS) often rely on representations of users and items in a joint embedding space and on a similarity metric to compute relevance scores. In modern RS, the modules to obtain user and item representations consist of two distinct and separate neural networks (NN). In multimodal representation learning, weight sharing has been proven effective in reducing the distance between multiple modalities of a same item. Inspired by these approaches, we propose a novel RS that leverages weight sharing between the user and item NN modules used to obtain the latent representations in the shared embedding space. The proposed framework consists of a single Collaborative Branch for Recommendation (CoBraR). We evaluate CoBraR by means of quantitative experiments on e-commerce and movie recommendation. Our experiments show that by reducing the number of parameters and improving beyond-accuracy aspects without compromising accuracy, CoBraR has the potential to be applied and extended for real-world scenarios. | Markus Schedl, Marta Moscati, Shah Nawaz | |||
| 71 | Rethinking Subjective Features in Recommender Systems: Personal Views Over Aggregated Values | 0 | Subjective features of content items, such as emotional resonance and aesthetic quality, have become increasingly important in recommender systems (RecSys), as the field moves beyond objective content and behavioral signals. Traditionally, such features were treated as fixed item-level properties, aggregated across users. However, emerging evidence suggests that subjective features are inherently user-dependent, shaped by individual interpretations and personal perspectives. This paper presents the first direct comparison between fixed (aggregated) and user-specific (subjective) item representations for modeling subjective features in RecSys. Using three datasets spanning movies, videos, and images, with subjective features, such as eudaimonia, hedonia, emotion, and aesthetics, we evaluate the impact of the representation strategy (i.e. fixed vs. user-specific) on recommendation performance across multiple algorithms. Our findings show that user-specific representations consistently outperform aggregate ones, often with statistically significant improvements. These results underscore the importance of modeling subjectivity at the user level, offering concrete guidance for more personalized and effective recommendation systems. | Arsen Matej Golubovikj, Marko Tkalcic | |||
| 72 | SAGEA: Sparse Autoencoder-based Group Embeddings Aggregation for Fairness-Preserving Group Recommendations | 0 | Group recommender systems (GRS) deliver suggestions to users who plan to engage in activities together, rather than individually. To be effective, they must reflect shared group interests while maintaining fairness by accounting for the preferences of individual members. Traditional approaches address fairness through post-processing, aggregating the recommendations after they are generated for each group member. However, this strategy adds significant complexity and offers only limited impact due to its late position in the GRS pipeline. In contrast, we propose an efficient in-processing method combining (1) monosemantic sparse user representations generated via a sparse autoencoder (SAE) bridge module, and (2) fairness-preserving group profile aggregation strategies. By leveraging disentangled representations, our Sparse Autoencoder-based Group Embeddings Aggregation (SAGEA) approach enables transparent, fairness-preserving profile aggregation within the GRS process. Experiments show that SAGEA improves both recommendation accuracy and fairness over profile and results aggregation baselines, while being more efficient than post-processing techniques. | Ladislav Peska, Martin Spisák, Vit Kostejn | |||
| 73 | A Tutorial on Recent Advances in Generative Conversational Recommender Systems | 0 | Conversational recommender systems (CRSs) are increasingly vital for delivering multi-turn, context-aware recommendations. This tutorial provides a concise yet comprehensive exploration of modern generative CRSs, highlighting recent advances in generative AI—such as breakthroughs in large language models and neural generation pipelines, that enhance dialogue management, user modeling, and response generation. In addition, the tutorial addresses core challenges, including data acquisition, multi-turn personalization, and evaluation issues, such as controlling hallucinations, accounting for social factors, and managing ethical considerations, while also discussing emerging risks and novel solutions. Ultimately, participants will be equipped with actionable insights and practical tools for building new conversational recommender systems powered by generative models. | Ahmadou Wagne, Ashmi Banerjee, Fatemeh Nazary, Julia Neidhardt, Thomas Elmar Kolb, Tommaso Di Noia, Yashar Deldjoo | |||
| 74 | concept2code: Sequential Recommendation with Large Language Models | 0 | Large Language Models (LLMs) have demonstrated remarkable abilities in understanding and generating language, inspiring their adoption in various domains beyond NLP. In this tutorial, we explore how LLMs can be leveraged to enhance sequential recommendation systems. We begin by introducing key LLM concepts relevant to this domain, followed by a structured presentation of increasingly advanced approaches, including semantic embeddings, text-rich modeling, latent relations, long-textual behavior modeling, hierarchical item-user architectures, model distillation, cross-domain generalization, and user-centric personalization. We also include live code walkthroughs that illustrate core ideas and real-world implementations. | Omprakash Sonie | |||
| 75 | Data Access for Recommender Systems Research: leveraging the EU's Digital Services Act | 0 | The European Union (EU) Digital Services Act (DSA) has introduced a novel set of rules for online platforms and search engines, with significant implications for the Recommender Systems community. Through its data access mechanisms, the DSA invites researchers to request both publicly available and private data from Very Large Online Platforms (VLOPs) and Very Large Search Engines (VLOSEs) – those with more than 45 million active recipients in the EU – to investigate systemic risks associated with the dissemination of illegal content, risks to the exercise of fundamental rights, and negative effects on electoral processes, public health, and gender-based violence. This tutorial is aimed at researchers who are interested in submitting such data access requests and will provide them with the knowledge to do so by introducing the relevant definitions and provisions of the DSA, and addressing the most important procedural steps to obtain data access and will provide attendees with a comprehensive understanding of the DSA’s data access implications for RecSys research. The tutorial targets researchers, practitioners, and students in understanding current developments in online platform regulation in Europe and their impact on RecSys research. | Emilia Gómez, Erasmo Purificato, João Vinagre, Lorenzo Porcaro, Silvia Merisio | |||
| 76 | Multi-Agentic Recommender Systems: Foundations, Design Patterns, and E-Commerce Applications - An Industrial Tutorial | 0 | The goal of this tutorial is to provide our perspective on the most recent advances in LLM-powered agents for recommender systems. Building on our extensive experience deploying agentic tools in large-scale environments, this tutorial hopes to deepen the understanding of participants with diverse backgrounds on the alphabets that underpin multi-agentic frameworks. Organized by the founders of leading agentic tools, the tutorial will highlight how these frameworks are being applied to create next-generation recommender systems in diverse applications. The examples include context-aware recommendation, dynamic multi-step orchestration, and personalized recommendation systems. To provide a solid foundation, we begin with a brief background on the evolution of recommender systems and how recent breakthroughs in large language models (LLMs) have shifted the paradigm toward more interactive, adaptive, and autonomous systems. The hands-on session will allow participants to directly engage with state-of-the-art techniques, bridging the gap between theoretical concepts and practical implementations. | Chi Wang, Derek Cheng, Jason Cho, Reza Yousefi Maragheh, Yashar Deldjoo | |||
| 77 | Personalized Image Generation for Recommendations Beyond Catalogs | 0 | Retrieval-based recommender systems are constrained by fixed catalogs, limiting their ability to serve diverse and evolving user preferences. We propose REBECA (REcommendations BEyond CAtalogs), a new class of preference-aware generative models for recommendation that synthesizes images tailored to individual tastes rather than retrieving items. REBECA conditions a diffusion model on users’ feedback (e.g., ratings) to generate personalized image embeddings in CLIP space, which are decoded into images via an adapter-on-adapter architecture that bypasses the need for image captions during training. By leveraging an expressive pre-trained image decoder and a lightweight probabilistic adapter, REBECA enables general-purpose image generation aligned with users’ visual preferences across diverse domains without expensive fine-tuning. We also introduce a new benchmark for personalized generation based on a curated version of the FLICKR-AES dataset, along with two novel personalization metrics tailored to the generative setting. Empirical results show that REBECA produces high-quality, diverse, and preference-aligned outputs, outperforming prompt-based personalization baselines on key personalization and quality metrics. By augmenting traditional retrieval with generative modeling, REBECA opens new opportunities for applications such as content design, personalization-first creative platforms, and preference-aware synthetic media. | Gabriel Alfonso Patron | |||
| 78 | On Inherited Popularity Bias in Cold-Start Item Recommendation | 0 | Collaborative filtering (CF) recommender systems struggle with making predictions on unseen, or ‘cold’, items. Systems designed to address this challenge are often trained with supervision from warm CF models in order to leverage collaborative and content information from the available interaction data. However, since they learn to replicate the behavior of CF methods, cold-start models may therefore also learn to imitate their predictive biases. In this paper, we show that cold-start systems can inherit popularity bias, a common cause of recommender system unfairness arising when CF models overfit to more popular items, thereby maximizing user-oriented accuracy but neglecting rarer items. We demonstrate that cold-start recommenders not only mirror the popularity biases of warm models, but are in fact affected more severely: because they cannot infer popularity from interaction data, they instead attempt to estimate it based solely on content features. This leads to significant over-prediction of certain cold items with similar content to popular warm items, even if their ground truth popularity is very low. Through experiments on three multimedia datasets, we analyze the impact of this behavior on three generative cold-start methods. We then describe a simple post-processing bias mitigation method that, by using embedding magnitude as a proxy for predicted popularity, can produce more balanced recommendations with limited harm to user-oriented cold-start accuracy. | Gregor Meehan, Johan Pauwels | |||
| 79 | Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation | 0 | Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90 | Gregório F. Azevedo, Pedro R. Pires, Pietro L. Campos, Rafael T. Sereicikas, Tiago A. Almeida | |||
| 80 | Generalized User Representations for Large-Scale Recommendations and Downstream Tasks | 0 | Accurately capturing diverse user preferences at scale is a core challenge for large-scale recommender systems like Spotify’s, given the complexity and variability of user behavior. To address this, we propose a two-stage framework that combines representation learning and transfer learning to produce generalized user embeddings. In the first stage, an autoencoder compresses rich user features into a compact latent space. In the second, task-specific models consume these embeddings via transfer learning, removing the need for manual feature engineering. This approach enhances flexibility by allowing dynamic updates to input features, enabling near-real-time responsiveness. The framework has been deployed in production at Spotify with an efficient infrastructure that allows downstream models to operate independently. Extensive online experiments in a live setting show significant improvements in metrics such as consumption share, content discovery, and search success. Additionally, our method achieves these gains while substantially reducing infrastructure costs. | Claire Keum, Ghazal Fazelnia, Guillermo Carrasco Hernández, Ian Anderson, Mark Koh, Maya Hristakeva, Mounia Lalmas, Nandini Singh, Petter Pehrson Skidén, Sanket Gupta, Stephen Xie, Timothy Christopher Heath | |||
| 81 | Personalized Interest Graphs for Theme-Driven User Behavior | 0 | Many eBay users turn to our platform to pursue theme-centric interests that span diverse product categories—for example, a Star Wars fan might search for related video games, toys, memorabilia, and artwork. Existing recommendation systems, typically optimized for short-term engagement, often fail to surface cross-category items aligned with these deeper interests. We present an end-to-end recommendation framework built around a user-interest graph generated by LLM chain. The graph captures user preferences at multiple levels of granularity, enabling a balance between relevance-driven and serendipity-driven recommendations. The system has been deployed at scale, serving millions of users across billions of items. An online A/B test on the eBay homepage showed a significant improvement in engagement with previously unseen categories, alongside gains in purchases and buyer count. | Guy Feigenblat, Leandro Fiaschetti, Nazmul Chowdhury, Oded Zinman, Yotam Eshel, Yuri M. Brovman | |||
| 82 | SlateLLM: Distilling LLM Semantics into Session-Aware Slate Recommendation without Inference Overhead | 0 | Session-based slate recommendation systems curate ranked sets of items in real-time, adapting to evolving user interactions. Balancing relevance, diversity, and novelty remains challenging for reinforcement learning (RL) methods. Recent advances in large language models (LLMs) offer a new possibility to leverage their semantic reasoning capabilities to refine slate composition. In this work, we examine the impact of LLM-driven reasoning on slate generation by integrating LLMs with an RL-based slate recommender and evaluating in terms of accuracy, similarity, diversity, and novelty. We extend the RecSim framework with real-world interaction data and introduce a session-aware evaluation protocol that captures long-term engagement. Our analysis reveals that LLM reasoning enhances subcategory-level diversity while maintaining relevance, leading to increased user engagement. By visualizing category-level shifts in slate composition we uncover systematic patterns in how LLMs refine recommendation diversity. Although direct LLM use during inference may be hampered by computational demands and latency concerns, our experimental results demonstrate that integrating LLM modifications during training enables the model to internalize the nuanced characteristics of LLM reasoning without incurring inference overhead, thereby improving recommendation performance, serving time efficiency, and deployability. | Aayush Singha Roy, Aonghus Lawlor, Elias Z. Tragos, Neil Hurley | |||
| 83 | Auditing Recommender Systems for User Empowerment in Very Large Online Platforms under the Digital Services Act | 0 | The governance of recommender systems (RSs) in very large online platforms (VLOPs) is expected to change significantly under the Digital Services Act (DSA), which imposes new obligations on transparency and user control. However, beyond legal compliance, a critical question remains: How can recommender systems be redesigned to genuinely empower users and foster meaningful personalization? This paper addresses this question by analyzing how three major short-video platforms—Instagram, TikTok, and YouTube—have implemented the DSA requirements for RSs. By reviewing their audit reports, systemic risk assessments, and compliance strategies, we evaluate the extent to which current approaches enhance user autonomy and control over content exposure. Building on this analysis, we outline a perspective for the future of VLOPs’ RSs grounded in speculative design. We argue that meaningful personalization should integrate algorithmic choice, balancing proportionality and granularity in RS customization, and content curation, ensuring diversity and authoritativeness to mitigate systemic risks. By bridging legal analysis, platform governance, and user-centered design, this paper outlines actionable pathways for aligning technical developments with regulatory objectives. Our findings contribute to interdisciplinary research on RSs by highlighting how platforms can move beyond minimal compliance toward a model that prioritizes user empowerment and content pluralism. | Ludovico Boratto, Matteo Fabbri | |||
| 84 | Breaking Knowledge Boundaries: Cognitive Distillation-enhanced Cross-Behavior Course Recommendation Model | 0 | Online Course Recommendation (CR) stands as a promising educational strategy within online education platforms, with the goal of providing personalized learning experiences for learners and enhancing their learning efficiency. Existing CR methods focus on modeling learners’ learning needs from their historical course interactions by adopting general recommendation techniques, but fail to consider the shifts in course preferences caused by cognitive states. While Cognitive Diagnosis (CD) techniques are adept at tracking cognitive states’ evolution via mining learner-exercise interactions and benefit the CR task, it is non-trivial to integrate CD and CR properly due to several challenges, including accurate diagnosis, divergent task objectives, and inconsistent data magnitude. To address these challenges, we propose a Cognitive Distillation-enhanced Cross-Behavior Course Recommendation model (C3Rec), which aims to transfer the knowledge of learners’ cognitive states to enhance the CR task. Specifically, for accurate diagnosis, we introduce a dual-granularity cognitive diagnosis module to capture learner representations at both coarse and fine granularities, thereby achieving a comprehensive construction of learners’ cognitive states. For divergent task objectives, we design a cross-behavior course recommendation module to jointly profile the dynamic course preferences from two temporal interleaved learning behaviors, achieving the seamlessly semantic alignment between these two tasks. For inconsistent data magnitude, we introduce a triple-stage distillation mechanism to exploit cognitive state features as prior knowledge, enhancing the CR task by further profiling learners’ course preferences. Experimental comparisons with multiple state-of-the-art methods on two real-world educational datasets demonstrate the effectiveness of our model. | Chenzhang Li, Hua Chu, Jianan Li, Ruoyu Li, Yangtao Zhou, Yuhan Bian | |||
| 85 | Enhancing Online Video Recommendation via a Coarse-to-fine Dynamic Uplift Modeling Framework | 0 | The popularity of short video applications has brought new opportunities and challenges to video recommendation. In addition to the traditional ranking-based pipeline, industrial solutions usually introduce additional distribution management components to guarantee a diverse and content-rich user experience. However, existing solutions are either non-personalized or fail to generalize well to the ever-changing user preferences. Inspired by the success of uplift modeling in online marketing, we attempt to implement uplift modeling in the video recommendation scenario to mitigate the problems. However, we face two main challenges when migrating the technique: 1) the complex-response causal relation in distribution management problem, and 2) the modeling of long-term and real-time user preferences. To address these challenges, we correspond each treatment to a specific adjustment of the distribution over video types, then propose a Coarse-to-fine Dynamic Uplift Modeling (CDUM) framework for real-time video recommendation scenarios. Specifically, CDUM consists of two modules, a coarse-grained module that utilizes the offline features of users to model their long-term preferences, and a fine-grained module that leverages online real-time contextual features and request-level candidates to model users’ real-time interests. These two modules collaboratively and dynamically identify and target specific user groups, and then apply treatments effectively. We conduct comprehensive experiments on two offline public datasets, an industrial offline dataset, and an online A/B test, demonstrating the superiority and effectiveness of CDUM. The proposed method is fully deployed on Kuaishou platform, serving hundreds of millions of users every day. Our code and datasets are available at https://github.com/UpliftVideo/CDUM. | Chang Meng, Chenhao Zhai, Han Li, Kun Gai, Lantao Hu, Shuchang Liu, Xiaoqiang Feng, Xiu Li, Xueliang Wang | |||
| 86 | Exploring Scaling Laws of CTR Model for Online Performance Improvement | 0 | CTR models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of LLMs, we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81 | Beihong Jin, Jia Cheng, Jian Dong, Jiongyan Zhang, Jun Lei, Weijiang Lai, Xingxing Wang, Yiyuan Zheng | |||
| 87 | Leave No One Behind: Fairness-Aware Cross-Domain Recommender Systems for Non-Overlapping Users | 0 | Cross-domain recommendation (CDR) methods predominantly leverage overlapping users to transfer knowledge from a source domain to a target domain. However, through empirical studies, we uncover a critical bias inherent in these approaches: while overlapping users experience significant enhancements in recommendation quality, non-overlapping users benefit minimally and even face performance degradation. This unfairness may erode user trust, and, consequently, negatively impact business engagement and revenue. To address this issue, we propose a novel solution that generates virtual source-domain users for non-overlapping target-domain users. Our method utilizes a dual attention mechanism to discern similarities between overlapping and non-overlapping users, thereby synthesizing realistic virtual user embeddings. We further introduce a limiter component that ensures the generated virtual users align with real-data distributions while preserving each user's unique characteristics. Notably, our method is model-agnostic and can be seamlessly integrated into any CDR model. Comprehensive experiments conducted on three public datasets with five CDR baselines demonstrate that our method effectively mitigates the CDR non-overlapping user bias, without loss of overall accuracy. Our code is publicly available at https://github.com/WeixinChen98/VUG. | Li Chen, Weike Pan, Weixin Chen, Yuhan Zhao | |||
| 88 | Lasso: Large Language Model-based User Simulator for Cross-Domain Recommendation | 0 | Cross-Domain Recommendation (CDR) aims to mitigate the cold-start problem in target domains by leveraging user interactions from source domains. However, existing CDR methods often suffer from low data efficiency, as they require a substantial number of historical interactions from overlapping users for training, which is impractical in real-world scenarios. To address this challenge, we propose Lasso, a novel framework that leverages the large language model (LLM) as a user simulator to capture cross-domain user preferences based on the remarkable internal knowledge of the LLM. Specifically, we introduce a cross-domain training paradigm to fine-tune the LLM-based simulator, enabling it to simulate user behaviors in the target domain using historical interactions from the source domain. Furthermore, to enhance the efficiency and accuracy of Lasso, we propose two effective modules: Personalized Candidate Pool (PCP) and Confidence-Guided Inference (CGI). The PCP module employs cross-domain collaborative filtering to construct a tailored set of candidate items for simulating interactions of each cold-start user in the target domain, thereby improving the inference efficiency of the LLM. The CGI module utilizes confidence scores from the LLM to reduce noise in the simulated data, ensuring more accurate estimations. During the application phase, the simulated interactions serve as additional inputs for downstream recommendation models, effectively alleviating cold-start problems for users. Extensive experiments on public benchmark datasets and real-world industrial dataset demonstrate that Lasso achieves superior accuracy while requiring fewer historical interactions from overlapping users. | Chao Wang, Chenyi Lei, Han Li, Mingyue Cheng, Susen Yang, Tong Zhang, Yue Chen | |||
| 89 | Correcting the LogQ Correction: Revisiting Sampled Softmax for Large-Scale Retrieval | 0 | Two-tower neural networks are a popular architecture for the retrieval stage in recommender systems. These models are typically trained with a softmax loss over the item catalog. However, in web-scale settings, the item catalog is often prohibitively large, making full softmax infeasible. A common solution is sampled softmax, which approximates the full softmax using a small number of sampled negatives. One practical and widely adopted approach is to use in-batch negatives, where negatives are drawn from items in the current mini-batch. However, this introduces a bias: items that appear more frequently in the batch (i.e., popular items) are penalized more heavily. To mitigate this issue, a popular industry technique known as logQ correction adjusts the logits during training by subtracting the log-probability of an item appearing in the batch. This correction is derived by analyzing the bias in the gradient and applying importance sampling, effectively twice, using the in-batch distribution as a proposal distribution. While this approach improves model quality, it does not fully eliminate the bias. In this work, we revisit the derivation of logQ correction and show that it overlooks a subtle but important detail: the positive item in the denominator is not Monte Carlo-sampled - it is always present with probability 1. We propose a refined correction formula that accounts for this. Notably, our loss introduces an interpretable sample weight that reflects the model's uncertainty - the probability of misclassification under the current parameters. We evaluate our method on both public and proprietary datasets, demonstrating consistent improvements over the standard logQ correction. | Artem Matveev, Kirill Khrylchenko, Sergei Liamaev, Sergei S. Makeev, Vladimir Baikalov | |||
| 90 | Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations | 0 | Point-of-interest (POI) recommender systems help users discover relevant locations, but their effectiveness is often compromised by popularity bias, which disadvantages less popular, yet potentially meaningful places. This paper addresses this challenge by evaluating the effectiveness of context-aware models and calibrated popularity techniques as strategies for mitigating popularity bias. Using four real-world POI datasets (Brightkite, Foursquare, Gowalla, and Yelp), we analyze the individual and combined effects of these approaches on recommendation accuracy and popularity bias. Our results reveal that context-aware models cannot be considered a uniform solution, as the models studied exhibit divergent impacts on accuracy and bias. In contrast, calibration techniques can effectively align recommendation popularity with user preferences, provided there is a careful balance between accuracy and bias mitigation. Notably, the combination of calibration and context-awareness yields recommendations that balance accuracy and close alignment with the users' popularity profiles, i.e., popularity calibration. | Andrea Forster, Denis Helic, Dominik Kowald, Simone Kopeinik, Stefan Thalmann | |||
| 91 | Large Scale E-Commerce Model for Learning and Analyzing Long-Term User Preferences | 0 | Understanding long-term user preferences is critical for delivering consistent and personalized recommendations that go beyond short-term behavioral cues in large-scale e-commerce platforms. We present NILUS (Neural Inference for Long-Term User Signals), a content-based transformer model trained to predict user behavior over a K-day future window using up to one year of historical interaction data. NILUS learns user embeddings end-to-end via contrastive learning, using item representations from a fine-tuned sentence encoder. We introduce a novel evaluation framework to assess the model’s ability to capture enduring user interests, and demonstrate that NILUS delivers higher accuracy than strong baselines on a large-scale offline dataset spanning millions of users and diverse product verticals. When combined with short-term signals, NILUS further improves recommendation accuracy and diversity. Finally, a large-scale online A/B test on a multinational e-commerce platform confirms statistically significant gains in user engagement. | Bracha Shapira, Guy Feigenblat, Michelle Hwang, Tal Franji, Yonatan Hadar, Yotam Eshel | |||
| 92 | Mitigating Latent User Biases in Pre-trained VAE Recommendation Models via On-demand Input Space Transformation | 0 | Recommender systems can unintentionally encode protected attributes (e.g., gender, country, or age) in their learned latent user representations. Current in-processing debiasing approaches, notably adversarial training, effectively reduce the encoded information on private user attributes. These approaches modify the model parameters during training. Thus, to alternate between biased and debiased model, two separate models have to be trained. In contrast, we propose a novel method to debias recommendation models post-training, which allows switching between biased and debiased model at inference time. Focusing on state-of-the-art variational autoencoder (VAE) architectures, our method aims to reduce bias at input level (user–item interactions) by learning a transformation from input space to a debiased subspace. As the output of this transformation lies in the same space as the original input vector, we can use transformed (debiased) input vectors without the need to fine-tune the pre-trained model. We evaluate the effectiveness of our method on three datasets, MovieLens-1M, LFM2b-DemoBias, and EB-NeRD, from the movie, music, and news domains, respectively. Our experiments show that the proposed method achieves task performance (in terms of NDCG) and debiasing strength (in terms of balanced accuracy of an attacker network) that are comparable to applying adversarial training during the initial training procedure, while providing the added functionality of alternating between biased and debiased model at inference time. | David Penz, Gustavo Junior Escobedo Ticona, Markus Schedl | |||
| 93 | Impacts of Mainstream-Driven Algorithms on Recommendations for Children Across Domains: A Reproducibility Study | 0 | Children are often exposed to items curated by recommendation algorithms. Yet, research seldom considers children as a user group, and when it does, it is anchored on datasets where children are underrepresented, risking overlooking their interests, favoring those of the majority, i.e., mainstream users. Recently, Ungruh et al. demonstrated that children's consumption patterns and preferences differ from those of mainstream users, resulting in inconsistent recommendation algorithm performance and behavior for this user group. These findings, however, are based on two datasets with a limited child user sample. We reproduce and replicate this study on a wider range of datasets in the movie, music, and book domains, uncovering interaction patterns and aspects of child-recommender interactions consistent across domains, as well as those specific to some user samples in the data. We also extend insights from the original study with popularity bias metrics, given the interpretation of results from the original study. With this reproduction and extension, we uncover consumption patterns and differences between age groups stemming from intrinsic differences between children and others, and those unique to specific datasets or domains. | Alejandro Bellogín, Dominik Kowald, Maria Soledad Pera, Robin Ungruh | |||
| 94 | TIM-Rec: Explicit Sparse Feedback on Multi-Item Upselling Recommendations in an Industrial Dataset of Telco Calls | 0 | Upselling recommendations play a critical role in improving customer engagement and maximizing revenue in the telecommunications industry. However, real-world data on such interactions often presents unique challenges, including multiple recommendations per call and sparse customer feedback, which complicates the evaluation of recommender systems. Our review of the existing literature reveals a critical gap in publicly available datasets that reflect these challenges, limiting progress in developing and evaluating upselling strategies. This work introduces a novel dataset that captures these complexities, offering valuable insights into customer behavior and recommendation effectiveness. The dataset, derived from real-world interactions between customers and service providers, contains multiple recommendations provided in individual calls and sparse feedback, reflecting typical user behavior where interest may be low or unrecorded. To aid in the development of more effective recommendation systems, we provide detailed statistics on recommendation distributions, user engagement, and feedback patterns. Furthermore, we benchmark various recommendation models, from classical approaches to state-of-the-art neural networks, allowing for a comprehensive assessment of their recommendation accuracy in this challenging setting. | Alessandro Sbandi, Fabrizio Silvestri, Federico Siciliano | |||
| 95 | Debiasing Implicit Feedback Recommenders via Sliced Wasserstein Distance-based Regularization | 0 | Recommendation models often encode users’ sensitive attributes (e.g., gender or age) in their learned representations during training, leading to biased (e.g., stereotypical) recommendations and potential privacy risks. To address this, previous research has predominantly focused on adversarial training to make user representations invariant to sensitive attributes. However, adversarial methods can be unstable and computationally expensive due to additional network parameters. An alternative approach is the use of regularization losses that minimize distributional discrepancies between different demographic groups during training. In particular, the Sliced Wasserstein Distance (SWD) provides a computationally efficient and stable solution for mitigating bias by directly aligning the distributions of user representations across groups. We follow this alternative strategy and propose an in-processing approach to mitigate encoded biases in user representations of implicit feedback-based recommender systems by using SWD-based regularization. We perform extensive experiments targeting the debiasing of the users’ gender on three datasets ML-1M, LFM2b-DB, and EB-NeRD from the movie, music, and news domains, respectively. Our results indicate that SWD-based regularization is an effective approach for mitigating encoded biases in user representations while keeping competitive recommendation accuracy. | David Penz, Gustavo Escobedo, Markus Schedl | |||
| 96 | From Previous Plays to Long-Term Tastes: Exploring the Long-term Reliability of Recommender Systems Simulations for Children | 0 | Studying the interplay of children and recommender systems (RS) is ethically and practically challenging, making simulation a promising alternative for exploration. However, recent simulation approaches that aim to model natural user-RS interactions typically rely on behavioral data and assume that user preferences remain consistent over time—an assumption that may not hold for children who undergo continuous developmental changes. With that in mind, we explore the extent to which simulations based on historical data can meaningfully reflect children’s long-term consumption patterns. We do this via a simulation study using real-world data in which user behavior is modeled from observed listening preferences. Specifically, we probe whether simulation mirrors user preferences over time by comparing with organic (i.e., real) consumption patterns. Our findings offer a critical reflection on the reliability of simulation-based RS research for children and question the reliability of using behavioral assumptions to model users. | Alejandro Bellogín, Maria Soledad Pera, Robin Ungruh | |||
| 97 | Affect-aware Cross-Domain Recommendation for Art Therapy via Music Preference Elicitation | 0 | Art Therapy (AT) is an established practice that facilitates emotional processing and recovery through creative expression. Recently, Visual Art Recommender Systems (VA RecSys) have emerged to support AT, demonstrating their potential by personalizing therapeutic artwork recommendations. Nonetheless, current VA RecSys rely on visual stimuli for user modeling, limiting their ability to capture the full spectrum of emotional responses during preference elicitation. Previous studies have shown that music stimuli elicit unique affective reflections, presenting an opportunity for cross-domain recommendation (CDR) to enhance personalization in AT. Since CDR has not yet been explored in this context, we propose a family of CDR methods for AT based on music-driven preference elicitation. A large-scale study with 200 users demonstrates the efficacy of music-driven preference elicitation, outperforming the classic visual-only elicitation approach. Our source code, data, and models are available at https://github.com/ArtAICare/Affect-aware-CDR | Bereket Abera Yilma, Luis A. Leiva | |||
| 98 | A Non-Parametric Choice Model That Learns How Users Choose Between Recommended Options | 0 | Choice models predict which items users choose from presented options. In recommendation settings, they can infer user preferences while countering exposure bias. In contrast with traditional univariate recommendation models, choice models consider which competitors appeared with the chosen item. This ability allows them to distinguish whether a user chose an item due to preference, i.e., they liked it; or competition, i.e., it was the best available option. Each choice model assumes specific user behavior, e.g., the multinomial logit model. However, it is currently unclear how accurately these assumptions capture actual user behavior, how wrong assumptions impact inference, and whether better models exist. In this work, we propose the learned choice model for recommendation (LCM4Rec), a non-parametric method for estimating the choice model. By applying kernel density estimation, LCM4Rec infers the most likely error distribution that describes the effect of inter-item cannibalization and thereby characterizes the users' choice model. Thus, it simultaneously infers what users prefer and how they make choices. Our experimental results indicate that our method (i) can accurately recover the choice model underlying a dataset; (ii) provides robust user preference inference, in contrast with existing choice models that are only effective when their assumptions match user behavior; and (iii) is more resistant against exposure bias than existing choice models. Thereby, we show that learning choice models, instead of assuming them, can produce more robust predictions. We believe this work provides an important step towards better understanding users' choice behavior. | Harrie Oosterhuis, Thorsten Krause | |||
| 99 | Enhancing Transferability and Consistency in Cross-Domain Recommendations via Supervised Disentanglement | 0 | Cross-domain recommendation (CDR) aims to alleviate the data sparsity by transferring knowledge across domains. Disentangled representation learning provides an effective solution to model complex user preferences by separating intra-domain features (domain-shared and domain-specific features), thereby enhancing robustness and interpretability. However, disentanglement-based CDR methods employing generative modeling or GNNs with contrastive objectives face two key challenges: (i) pre-separation strategies decouple features before extracting collaborative signals, disrupting intra-domain interactions and introducing noise; (ii) unsupervised disentanglement objectives lack explicit task-specific guidance, resulting in limited consistency and suboptimal alignment. To address these challenges, we propose DGCDR, a GNN-enhanced encoder-decoder framework. To handle challenge (i), DGCDR first applies GNN to extract high-order collaborative signals, providing enriched representations as a robust foundation for disentanglement. The encoder then dynamically disentangles features into domain-shared and -specific spaces, preserving collaborative information during the separation process. To handle challenge (ii), the decoder introduces an anchor-based supervision that leverages hierarchical feature relationships to enhance intra-domain consistency and cross-domain alignment. Extensive experiments on real-world datasets demonstrate that DGCDR achieves state-of-the-art performance, with improvements of up to 11.59 | Lin Li, Mengzi Tang, Qing Xie, Yongjian Liu, Yuhan Wang, Zhifeng Bao | |||
| 100 | Heterogeneous User Modeling for LLM-based Recommendation | 0 | Leveraging Large Language Models (LLMs) for recommendation has demonstrated notable success in various domains, showcasing their potential for open-domain recommendation. A key challenge to advancing open-domain recommendation lies in effectively modeling user preferences from users' heterogeneous behaviors across multiple domains. Existing approaches, including ID-based and semantic-based modeling, struggle with poor generalization, an inability to compress noisy interactions effectively, and the domain seesaw phenomenon. To address these challenges, we propose a Heterogeneous User Modeling (HUM) method, which incorporates a compression enhancer and a robustness enhancer for LLM-based recommendation. The compression enhancer uses a customized prompt to compress heterogeneous behaviors into a tailored token, while a masking mechanism enhances cross-domain knowledge extraction and understanding. The robustness enhancer introduces a domain importance score to mitigate the domain seesaw phenomenon by guiding domain optimization. Extensive experiments on heterogeneous datasets validate that HUM effectively models user heterogeneity by achieving both high efficacy and robustness, leading to superior performance in open-domain recommendation. | Fengbin Zhu, Fuli Feng, Honghui Bao, TatSeng Chua, Teng Sun, Wenjie Wang, Xinyu Lin | |||
| 101 | Hierarchical Graph Information Bottleneck for Multi-Behavior Recommendation | 0 | In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current state-of-the-art approaches typically employ hierarchical design following either cascading (e.g., view→cart→buy) or parallel (unified→behavior→specific components) paradigms, to capture behavioral relationships. However, these methods still face two critical challenges: (1) severe distribution disparities across behaviors, and (2) negative transfer effects caused by noise in auxiliary behaviors. In this paper, we propose a novel model-agnostic Hierarchical Graph Information Bottleneck (HGIB) framework for multi-behavior recommendation to effectively address these challenges. Following information bottleneck principles, our framework optimizes the learning of compact yet sufficient representations that preserve essential information for target behavior prediction while eliminating task-irrelevant redundancies. To further mitigate interaction noise, we introduce a Graph Refinement Encoder (GRE) that dynamically prunes redundant edges through learnable edge dropout mechanisms. We conduct comprehensive experiments on three real-world public datasets, which demonstrate the superior effectiveness of our framework. Beyond these widely used datasets in the academic community, we further expand our evaluation on several real industrial scenarios and conduct an online A/B testing, showing again a significant improvement in multi-behavior recommendations. The source code of our proposed HGIB is available at https://github.com/zhy99426/HGIB. | Chunxu Shen, Hengyu Zhang, Hong Cheng, Jie Tan, Lingling Yi, Xiangguo Sun, Yanchao Tan, Yu Rong | |||
| 102 | How Do Users Perceive Recommender Systems' Objectives? | 0 | Multi-objective recommender systems (MORS) aim to optimize multiple criteria while generating recommendations, such as relevance, novelty, diversity, or exploration. These algorithms are based on the assumption that an operationalization of these criteria (i.e., translating abstract goals into measurable metrics), will reflect how users perceive them. Nevertheless, such beliefs are rarely rigorously evaluated, which can lead to a mismatch between algorithmic goals and user satisfaction. Moreover, if users are allowed to control the RS via their propensities towards such objectives, the misconceptions may further impact users’ trust and engagement. To characterize this problem, we conduct a large user study focusing on recommender systems in two domains: books and movies. Part of the study is focused on how users perceive different recommendation objectives, which we compared with well-established metrics aiming at the same objectives. We found that despite such metrics correlating to some extent with users’ perceptions, the mapping is far from perfect. Moreover, we also report on conceptual-level differences in users’ understanding of RS objectives and how this affects the results. Study data are available from https://osf.io/2n9mf/. | Ladislav Peska, Ludovico Boratto, Patrik Dokoupil | |||
| 103 | LANCE: Exploration and Reflection for LLM-based Textual Attacks on News Recommender Systems | 0 | News recommender systems rely on rich textual information from news articles to generate user-specific recommendations. This reliance may expose these systems to potential vulnerabilities through textual attacks. To explore this vulnerability, we propose LANCE, a LArge language model-based News Content rEwriting framework, designed to influence news rankings and highlight the unintended promotion of manipulated news. LANCE consists of two key components: an explorer and a reflector. The explorer first generates rewritten news using diverse prompts, incorporating different writing styles, sentiments, and personas. We then collect these rewrites, evaluate their ranking impact within news recommender systems, and apply a filtering mechanism to retain effective rewrites. Next, the reflector fine-tunes an open-source LLM using the successful rewrites, enhancing its ability to generate more effective textual attacks. Experimental results demonstrate the effectiveness of LANCE in manipulating rankings within news recommender systems. Unlike attacks in other recomendation domains, negative and neutral rewrites consistently outperform positive ones, revealing a unique vulnerability specific to news recommendation. Once trained, LANCE successfully attacks unseen news recommender systems (i.e., those for which LANCE received no information during training), highlighting its generalization ability and exposing shared vulnerabilities across different systems. Our work underscores the urgent need for research on textual attacks and paves the way for future studies on defense strategies. | Jiancan Wu, Jin Huang, Maarten de Rijke, Shuchang Liu, Xiang Wang, Yuyue Zhao | |||
| 104 | MDSBR: Multimodal Denoising for Session-based Recommendation | 0 | Multimodal session-based recommendation (SBR) has emerged as a promising direction for capturing user intent using visual and textual item content. However, existing methods often overlook a fundamental issue: the modality features extracted from pre-trained models (e.g., BERT, CLIP) are inherently noisy and misaligned with user-specific preferences. This noise arises from label errors, task mismatch, and over-inclusion of irrelevant content, ultimately degrading recommendation quality. In this work, we propose a diffusion-based denoising framework that explicitly refines noisy pre-trained representations without full fine-tuning. By a structured denoising process, our Multimodal Denoising Diffusion Layer progressively eliminates the noise introduced by pre-trained models. Furthermore, we introduce two auxiliary modules: an Interest-Guided Denoising Layer that filters modality features using user interest, and a Multimodal Alignment Layer that enforces cross-modal coherence. Extensive experiments on real-world datasets demonstrate that our model significantly outperforms state-of-the-art methods while maintaining practical training efficiency. The code is available in https://github.com/YutongLi2024/MDSBR. | Xinyi Zhang, Yutong Li | |||
| 105 | On the Reliability of Sampling Strategies in Offline Recommender Evaluation | 0 | Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons. | Alan Said, Bruno L. Pereira, Rodrygo L. T. Santos | |||
| 106 | Privacy-Preserving Social Recommendation: Privacy Leakage and Countermeasure | 0 | Social recommendation systems generally utilize two types of data, user-item interaction matrices (R) from rating platform (P0), and user-user social graphs (S) from social platform (P1). Considering user privacy that neither R nor S can be directly shared, Chen et al. introduced the Secure Social Recommendation (SeSoRec) framework with the Secret Sharing-based Matrix Multiplication (SSMM) protocol. However, we find that the leakage of intermeidate information introduced by SSMM will eventually lead to the leakage of S to P0, which challenges the privacy guarantees of SeSoRec. This work firstly identifies that the claimed "innocuous" leakage in SeSoRec originates from reusing the same One-Time Pad key during two randomization phases in SSMM, with formal proof that SSMM violates semi-honest security. Secondly, this work proposes the Two-Time Pad Attack with two reconstruction algorithms to evaluate the severity of the leakage. The Two-Time Pad Attack can extract the column-wise sum of matrices (\mathbf {A}{c{-}!even}) and (\mathbf {A}{c{-}!odd}), and the row-wise difference of matrices (\mathbf {B}{r{-}!even}) and (\mathbf {B}{r{-}!odd}), where such matrices are closely related to R or S. The Sparse Matrix Reconstruction (SMR) algorithm can achieve 99.35%, 83.83%, and 77.14% reconstruction rates for non-zero entries in S on FilmTrust, Epinions, and Douban datasets, respectively. The Grayscale Image Reconstruction (GIR) algorithm can successfully recover MNIST image contours. Thirdly, when the number of columns/rows of the input matrix A/B in SSMM is odd (requiring zero-padding to an even dimension), this work proposes the Zero-Padding Attack which can directly expose the last column/row of A/B. Finally, this work proposes the Privacy-Preserving Matrix Multiplication (PPMM) protocol with experimental demonstration as a replacement for SSMM, which eliminates such leakage while maintaining efficiency. | Chuanyi Liu, Junbin Fang, Peng Yang, Wenhao Wu, Xuan Wang, Yuyue Chen, Zoe Lin Jiang | |||
| 107 | Tag-augmented Dual-target Cross-domain Recommendation | 0 | Cross-domain recommendation (CDR) has been proposed to alleviate the data sparsity issue in recommendation systems and has garnered substantial research interest. In recent years, dual-target CDR has been an increasingly prevalent research topic that emphasizes simultaneous enhancement in both the source and target domains. Many existing approaches rely on overlapping users as bridges between domains, yet in real-world scenarios, the number of such users is often severely limited, restricting their practical applicability. To overcome this limitation, alternative methods for cross-domain connections are needed, and item tags serve as a promising solution. However, real-world tags suffer from severe deficiencies in terms of both quantity and diversity, and existing studies have not fully exploited their potential. In this paper, we introduce Tag-Augmented Dual-Target Cross Domain Recommendation (TA-DTCDR), which is the first to apply LLM-distilled tag information to CDR. TA-DTCDR utilizes item tags distilled by large language models (LLMs) as an additional channel to facilitate information transfer, thereby mitigating performance decline caused by the lack of overlapping users. Furthermore, to fully leverage the natural language information carried by the distilled tags, we design a series of training tasks to align tag semantics across domains while preserving their semantic independence. The proposed method is validated on multiple tasks using public datasets, showing significant improvements over existing state-of-the-art approaches. | Enhong Chen, Gang Zhou, Jianhui Ma, Mingfan Pan, Mingyue Cheng, Qingyang Mao, Xu An | |||
| 108 | Biases in LLM-Generated Musical Taste Profiles for Recommendation | 0 | One particularly promising use case of Large Language Models (LLMs) for recommendation is the automatic generation of Natural Language (NL) user taste profiles from consumption data. These profiles offer interpretable and editable alternatives to opaque collaborative filtering representations, enabling greater transparency and user control. However, it remains unclear whether users consider these profiles to be an accurate representation of their taste, which is crucial for trust and usability. Moreover, because LLMs inherit societal and data-driven biases, profile quality may systematically vary across user and item characteristics. In this paper, we study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog. We conduct a user study in which participants rate NL profiles generated from their own listening histories. We analyze whether identification with the profiles is biased by user attributes (e.g., mainstreamness, taste diversity) and item features (e.g., genre, country of origin). We also compare these patterns to those observed when using the profiles in a downstream recommendation task. Our findings highlight both the potential and limitations of scrutable, LLM-based profiling in personalized systems. | Bruno Sguerra, Elena V. Epure, Harin Lee, Manuel Moussallam | |||
| 109 | Estimating Quantum Execution Requirements for Feature Selection in Recommender Systems Using Extreme Value Theory | 0 | Recent advances in quantum computing have significantly accelerated research into quantum-assisted information retrieval and recommender systems, particularly in solving feature selection problems by formulating them as Quadratic Unconstrained Binary Optimization (QUBO) problems executable on quantum hardware. However, while existing work primarily focuses on effectiveness and efficiency, it often overlooks the probabilistic and noisy nature of real-world quantum hardware. In this paper, we propose a solution based on Extreme Value Theory (EVT) to quantitatively assess the usability of quantum solutions. Specifically, given a fixed problem size, the proposed method estimates the number of executions (shots) required on a quantum computer to reliably obtain a high-quality solution, which is comparable to or better than that of classical baselines on conventional computers. Experiments conducted across multiple quantum platforms (including two simulators and two physical quantum processors) demonstrate that our method effectively estimates the number of required runs to obtain satisfactory solutions on two widely used benchmark datasets. | Jiayang Niu, Jie Li, Ke Deng, Mark Sanderson, Qihan Zou, Yongli Ren | |||
| 110 | Not One News Recommender To Fit Them All: How Different Recommender Strategies Serve Various User Segments | 0 | Many news recommender systems (NRS) adopt a one-recommender-for-all approach, overlooking that users engage with news in fundamentally different ways. In this work, we identify user clusters based on various engagement metrics that go beyond clicks by employing cluster analysis on two real-world datasets: EB-NeRD and Adressa. Next to that, we evaluate the performance of common recommender strategies: popularity, collaborative filtering (EASE and ItemKNN), and a content-based model across these user clusters, which exhibit varying reading behaviors and information needs. Our findings show that different recommender strategies are effective to varying degrees depending on the user cluster. This study contributes to NRS research by providing a grounded clustering of users derived from real-world datasets and emphasizes the importance of user-centered evaluations for understanding how NRS strategies serve audiences with varying levels of news engagement. | Annelien Smets, Hanne Vandenbroucke, Lien Michiels, Ulysse Maes | |||
| 111 | Popularity‑Bias Vulnerability: Semi‑Supervised Label Inference Attack on Federated Recommender Systems | 0 | Organizations are increasingly applying Vertical Federated Learning (VFL) to enhance recommender systems without sharing raw data among themselves. However, partial outputs in VFL remain to introduce significant privacy risks. In this study, we propose a novel label inference attack specifically tailored for VFL-based recommender systems, leveraging two common characteristics: (1) item popularity often follows a power-law distribution, and (2) random negative sampling is commonly used for implicit feedback, a substitute for non-existing true labels. By combining partial local information from VFL with this prior knowledge, a malicious party can construct a semi-supervised learning pipeline. The experimental results of three real-world datasets demonstrate that our approach achieves a higher label inference performance than the existing attacks. These findings demonstrate the need for more robust privacy preserving mechanisms in federated recommender systems. | Kenji Shinoda, Shintaro Fukushima, Takeyuki Sasai | |||
| 112 | A Reproducibility Study of Product-side Fairness in Bundle Recommendation | 0 | Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain. | Alan Hanjalic, HuySon Nguyen, Maarten de Rijke, Masoud Mansoury, Mohammad Aliannejadi, Yuanna Liu | |||
| 113 | Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems | 0 | Serendipity plays a pivotal role in enhancing user satisfaction within recommender systems, yet its evaluation poses significant challenges due to its inherently subjective nature and conceptual ambiguity. Current algorithmic approaches predominantly rely on proxy metrics for indirect assessment, often failing to align with real user perceptions, thus creating a gap. With large language models (LLMs) increasingly revolutionizing evaluation methodologies across various human annotation tasks, we are inspired to explore a core research proposition: Can LLMs effectively simulate human users for serendipity evaluation? To address this question, we conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains, focusing on three key aspects: the accuracy of LLMs compared to conventional proxy metrics, the influence of auxiliary data on LLM comprehension, and the efficacy of recently popular multi-LLM techniques. Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics. Furthermore, multi-LLM techniques and the incorporation of auxiliary data further enhance alignment with human perspectives. Based on our findings, the optimal evaluation by LLMs yields a Pearson correlation coefficient of 21.5% when compared to the results of the user study. This research implies that LLMs may serve as potentially accurate and cost-effective evaluators, introducing a new paradigm for serendipity evaluation in recommender systems. | Li Chen, Li Kang, Yuhan Zhao | |||
| 114 | Informfully Recommenders - Reproducibility Framework for Diversity-aware Intra-session Recommendations | 0 | Norm-aware recommender systems have gained increased attention, especially for diversity optimization. The recommender systems community has well-established experimentation pipelines that support reproducible evaluations by facilitating models' benchmarking and comparisons against state-of-the-art methods. However, to the best of our knowledge, there is currently no reproducibility framework to support thorough norm-driven experimentation at the pre-processing, in-processing, post-processing, and evaluation stages of the recommender pipeline. To address this gap, we present Informfully Recommenders, a first step towards a normative reproducibility framework that focuses on diversity-aware design built on Cornac. Our extension provides an end-to-end solution for implementing and experimenting with normative and general-purpose diverse recommender systems that cover 1) dataset pre-processing, 2) diversity-optimized models, 3) dedicated intrasession item re-ranking, and 4) an extensive set of diversity metrics. We demonstrate the capabilities of our extension through an extensive offline experiment in the news domain. | Abraham Bernstein, Lucien Heitz, Oana Inel, Runze Li | |||
| 115 | Revisiting the Performance of Graph Neural Networks for Session-based Recommendation | 0 | Graph Neural Networks (GNNs) have shown impressive performance in various domains. Motivated by this success, several GNN-based session-based recommender systems (SBRS) have been proposed over the past few years. The literature suggests that these algorithms can achieve strong performance and outperform well-established baseline neural models. However, some recent reproducibility studies suggest that the performance achieved by more complex GNN-based models may sometimes be overstated and that these models may not be as impactful as expected. Moreover, an inconsistent choice of datasets, preprocessing steps, and evaluation protocols across published works makes it difficult to reliably assess progress in the field. In this present study, we reassess the performance of three well-established baseline models—GRU4Rec, NARM, and STAMP—and compare them to six more recent GNN-based SBRS within a standardized evaluation framework. Experiments on commonly used datasets for SBRS reveal that in particular the GRU4Rec model, if properly tuned, is still highly competitive and leads to the best results on two out of three datasets. Furthermore, we find that the performance of the GNN-based models varies largely across datasets. Interestingly, only the quite early SR-GNN model turns out to be superior in terms of accuracy metrics on one of the datasets. We speculate that the reasons for our surprising result may lie in insufficient hyperparameter tuning processes for the baselines in the original papers. | Dietmar Jannach, Faisal Shehzad | |||
| 116 | Cross-Batch Aggregation for Streaming Learning from Label Proportions in Industrial-Scale Recommendation Systems | 0 | Recent controls over user data have diluted user signals essential to train industrial recommendation systems, replacing traditional event-level labels with aggregated item-level labels. Fitting these noisy aggregates into the event-level paradigm used by industrial recommendation systems causes models to be biased and miscalibrated, hurting critical business metrics. Learning from Label Proportions (LLP), a framework where instance-level prediction models are trained from aggregated signals, offers a principled solution to this problem — as long as all samples from an aggregate are present within the same training batch. Unfortunately, industry-scale recommender systems impose infrastructure constraints that fail this critical assumption because (1) they are trained in a sequential streaming framework that spreads aggregates across batches, (2) aggregates often exceed the size of a single batch, and (3) label noise makes it difficult to identify the time boundaries that correspond to the aggregated label. To address these issues, we propose a novel technique called Cross-Batch Aggregate (XBA) Loss to adapt LLP to the streaming setting. We design the loss to have a gradient that mimics the true aggregated loss gradient, approximating the distribution of the aggregate by using cumulative statistics across each aggregate. This enables (1) optimizing for model calibration and (2) learning a conversion model from the aggregate signals. We have deployed this technique to a Google Ads system impacted by conversion signal loss due to privacy constraints, delivering significant improvements on model calibration (48.8% reduction in online bias), advertiser value, and business metrics. Our key contribution is the extension of LLP to the streaming setting, providing a practical solution that bridges the gap between LLP research and industrial applications. | Adam Kraft, Andrew Evdokimov, Derek Zhiyuan Cheng, Ed H. Chi, Jerry Zhang, Jonathan Valverde, Ruoxi Wang, Samuel Ieong, Tiansheng Yao, Xiang Li, Yin Zhang, Yuan Gao | |||
| 117 | Enhancing Online Ranking Systems via Multi-Surface Co-Training for Content Understanding | 0 | Content understanding is an important part in real-world recommendation systems. This paper introduces a Multi-surface Co-training (MulCo) system, designed to enhance online ranking systems by improving content understanding. The model is trained through a task-aligned co-training approach, leveraging objectives and data from multiple video discovery feeding surfaces and various pre-trained embeddings. It separates video content understanding into an offline model, enabling scalability and efficient resource use. Experiments demonstrate that MulCo significantly outperforms non-task-aligned pre-trained embeddings and achieves substantial gains in online user value, e.g. satisfied engagement and freshness metrics. This system presents a practical solution to improve content understanding in multi-surface large-scale recommender systems. | Aniruddh Nath, Dapo Omidiran, Fabio Soldo, Gwendolyn Zhao, Li Wei, Lichan Hong, Lukasz Heldt, Mei Chen, Nikhil Khani, Qian Sun, Raghu Keshavan, Rein Zhang, Weilong Yang, Xinyang Yi, Yilin Zheng | |||
| 118 | Identifying Offline Metrics that Predict Online Impact: A Pragmatic Strategy for Real-World Recommender Systems | 0 | A critical challenge in recommender systems is to establish reliable relationships between offline and online metrics that predict real-world performance. Motivated by recent advances in Pareto front approximation, we introduce a pragmatic strategy for identifying offline metrics that align with online impact. A key advantage of this approach is its ability to simultaneously serve multiple test groups, each with distinct offline performance metrics, in an online experiment controlled by a single model. The method is model-agnostic for systems with a neural network backbone, enabling broad applicability across architectures and domains. We validate the strategy through a large-scale online experiment in the field of session-based recommender systems on the OTTO e-commerce platform. The online experiment identifies significant alignments between offline metrics and real-word click-through rate, post-click conversion rate and units sold. Our strategy provides industry practitioners with a valuable tool for understanding offline-to-online metric relationships and making informed, data-driven decisions. | Philipp Normann, Timo Wilm | |||
| 119 | Semantic IDs for Music Recommendation | 0 | Training recommender systems for next-item recommendation often requires unique embeddings to be learned for each item, which may take up most of the trainable parameters for a model. Shared embeddings, such as using content information, can reduce the number of distinct embeddings to be stored in memory. This allows for a more lightweight model; correspondingly, model complexity can be increased due to having fewer embeddings to store in memory. We show the benefit of using shared content-based features ('semantic IDs') in improving recommendation accuracy and diversity, while reducing model size, for two music recommendation datasets, including an online A/B test on a music streaming service. | Andreas F. Ehmann, Florian Henkel, M. Jeffrey Mei, Oliver Bembom, Samuel E. Sandberg | |||
| 120 | SocRipple: A Two-Stage Framework for Cold-Start Video Recommendations | 0 | Most industry scale recommender systems face critical cold start challenges new items lack interaction history, making it difficult to distribute them in a personalized manner. Standard collaborative filtering models underperform due to sparse engagement signals, while content only approaches lack user specific relevance. We propose SocRipple, a novel two stage retrieval framework tailored for coldstart item distribution in social graph based platforms. Stage 1 leverages the creators social connections for targeted initial exposure. Stage 2 builds on early engagement signals and stable user embeddings learned from historical interactions to "ripple" outwards via K Nearest Neighbor (KNN) search. Large scale experiments on a major video platform show that SocRipple boosts cold start item distribution by +36 | Ajantha Ramineni, Amit Jaspal, Kapil Dalwani | |||
| 121 | Stream Normalization for CTR Prediction | 0 | Deep learning models often encounter significant challenges when dealing with non-i.i.d. and non-stationary data, particularly in incremental learning tasks such as CTR prediction. Traditional normalization techniques, including Batch Normalization and Layer Normalization, are limited in their ability to maintain stability and adaptability under rapidly evolving data distributions, leading to degraded model performance. To address these limitations, we propose Stream Normalization (SN), a dynamic and adaptive normalization framework designed to continuously align normalization statistics with shifting data distributions in real-time. SN leverages specialized normalization modules, each optimized to capture distinct statistical patterns inherent in streaming data. Such design enhances model robustness and mitigates the risk of catastrophic forgetting by continuously adapting its normalization strategy. The SN layer is a versatile plugin that enhances model robustness across various normalization settings. Extensive experiments demonstrate that SN achieves state-of-the-art performance on offline datasets, representing a significant advancement in incremental learning for streaming data. | Changping Peng, Ching Law, Congcong Liu, Jingping Shao, Xue Jiang, Yizhou Sang, Yuying Chen, Zhangang Lin, Zhiwei Fang | |||
| 122 | Addressing Multiple Hypothesis Bias in CTR Prediction for Ad Selection | 0 | Predicting click-through rates (CTR) for candidate advertisements is central to many online recommendation and ad-serving systems. However, selecting top-ranked ads based on predicted CTR (pCTR) inherently introduces a systematic bias: since each pCTR contains random estimation error, ads ranked highest tend to exhibit positive error, leading to overestimation of true CTR and miscalibration. Furthermore, as the number of candidates grows, the extreme order statistics amplify this so-called Multiple Hypothesis Bias. Proper calibration of pCTR ensures that estimated probabilities match observed click frequencies, which is essential for setting accurate bids and maximizing revenue in ad auctions. Without reliable calibration, high-accuracy models can still misprice impressions, resulting in both lost revenue and inefficient budget allocation. In this paper, we (1) formally define the bias arising from ranking by noisy estimates and (2) derive an estimator to correct pCTR by subtracting the expected error under mild distributional assumptions. Experiments on large-scale ad data show significant improvements in calibration metrics across multiple ad settings. | Neil Daftary, Oren Sar Shalom | |||
| 123 | Opening the Black Box: Interpretable Remedies for Popularity Bias in Recommender Systems | 0 | Popularity bias is a well-known challenge in recommender systems, where a small number of popular items receive disproportionate attention, while the majority of less popular items are largely overlooked. This imbalance often results in reduced recommendation quality and unfair exposure of items. Although existing mitigation techniques address this bias to some extent, they typically lack transparency in how they operate. In this paper, we propose a post-hoc method using a Sparse Autoencoder (SAE) to interpret and mitigate popularity bias in deep recommendation models. The SAE is trained to replicate a pre-trained model's behavior while enabling neuron-level interpretability. By introducing synthetic users with clear preferences for either popular or unpopular items, we identify neurons encoding popularity signals based on their activation patterns. We then adjust the activations of the most biased neurons to steer recommendations toward fairer exposure. Experiments on two public datasets using a sequential recommendation model show that our method significantly improves fairness with minimal impact on accuracy. Moreover, it offers interpretability and fine-grained control over the fairness-accuracy trade-off. | Masoud Mansoury, Parviz Ahmadov | |||
| 124 | ArtEx: A User-Controllable Web Interface for Visual Art Recommendations | 0 | We introduce a web-based interface for visual art recommendations, empowering users to adjust popularity and diversity through intuitive sliders. Built on the SemArt dataset and leveraging multimodal BLIP features, ArtEx allows users to fine-tune recommendations across dimensions like genre, time period, and artist. This demo paper presents ArtEx’s interactive interface, showcasing its ability to enhance user engagement and satisfaction through transparent, user-driven personalization. | Bereket Abera Yilma, Luis A. Leiva, Peter Brusilovsky, Rully Agus Hendrawan | |||
| 125 | Flights Pricelock Fee Recommendation on Online Travel Agent Platform | 0 | In this study, we present a neural network (NN) based recommender system with novel custom loss function developed to recommend fee for its pricelock product. It is a popular add-on product that allows users to lock a flight price and book it later at the same locked price, even if the price increases while flight booking. The core challenge in enabling this product lies in predicting the magnitude of future price changes over time horizons. We formulate this problem as a multi-task learning (MTL) setup, where price change magnitudes are modeled as ordinal categories across several time intervals modeled as heads. Crucially, we address the ordinal nature of price change buckets by introducing a novel loss function called Learnable Soft Ordinal Regression (L-SORD). Our demo showcases how this system improves both predictive accuracy and revenue performance, enabling more effective price recommendations in a high stakes, real world environment. This work highlights the potential of combining MTL architectures with custom loss functions in production grade pricing recommender systems. | Akash Khetan, Anmol Porwal, Deepak Yadav, Narasimha Medeme | |||
| 126 | RecViz: Intuitive Graph-based Visual Analytics for Dataset Exploration and Recommender System Evaluation | 0 | We present RecViz, a novel web application designed to support qualitative analysis of recommender system performance on large datasets. RecViz offers real-time, interactive graph visualisation of recommendation data, enabling side-by-side comparisons of models through dual graph views. Leveraging GPU acceleration via CUDA and WebGL, it delivers fast, responsive force-directed layouts, even at scale. Unlike prior tools limited to small datasets, RecViz shows the potential to handle large datasets efficiently. For example, it maintains an average of 28 FPS while visualising the full MovieLens-1M dataset, with all 1 million interactions. RecViz is open-source and available on GitHub under the Apache-2.0 licence [7]. | Iadh Ounis, Jackson Dam, Zixuan Yi | |||
| 127 | RecSys Challenge 2025: Universal Behavioral Profiles for Recommender Systems | 0 | The RecSys Challenge 2025 promotes a unified approach to behavior modeling by introducing Universal Behavioral Profiles. These user representations encode essential aspects of past interactions and are designed for universal applicability across different downstream tasks, thereby promoting generalization across applications and addressing the need for portable and efficient recommender systems. The participants task was to create universal user embeddings from detailed e-commerce activity logs. These embeddings were then fed into a small neural network to predict customer behavior in subsequent timeframes. The provided challenge dataset was large and sparse, requiring innovative methods to leverage the available interaction data in an effective way. Overall, the challenge was highly attractive with 400 teams participating in the competition. | Abhishek Srivastava, Claudio Pomo, Dietmar Jannach, Francesco Barile, Gergely Stomfai, Jacek Dabrowski, Lukasz Sienkiewicz, Marco Polignano, Maria Janicka | |||
| 128 | Recommender Systems for Digital Humanities and Archives: Multistakeholder Evaluation, Scholarly Information Needs, and Multimodal Similarity | 0 | Recommender systems (RecSys) in digital humanities (DH) and archives face unique challenges, including balancing competing stakeholder values, serving complex scholarly information needs, and modeling multimodal historical artifacts. This paper reports on ongoing research that tackles these issues through three interconnected strands: (1) the development of co-designed multistakeholder evaluation frameworks that move beyond simple engagement metrics to capture diverse priorities among archivists, researchers, and platform owners; (2) a systematic examination of the information behaviors of humanities scholars to inform user models adapted to exploratory, non-linear research; and (3) the creation of multimodal similarity metrics that exploit scholarly markup, material characteristics, and specialized domain knowledge. Validated through Monasterium.net—the world’s largest charter archive—this research contributes novel approaches to value-driven evaluation, scholarly user modeling, and historical document similarity. It provides methodological frameworks to bridge the computer science and DH communities, and to advance multistakeholder RecSys for complex, non-traditional domains. | Florian AtzenhoferBaumgartner | |||
| 129 | An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization | 0 | We study the problem of personalizing the output of a large language model (LLM) by training on logged bandit feedback (e.g., personalizing movie descriptions based on likes). While one may naively treat this as a standard off-policy contextual bandit problem, the large action space and the large parameter space make naive applications of off-policy learning (OPL) infeasible. We overcome this challenge by learning a prompt policy for a frozen LLM that has only a modest number of parameters. The proposed Direct Sentence Off-policy gradient (DSO) effectively propagates the gradient to the prompt policy space by leveraging the smoothness and overlap in the sentence space. Consequently, DSO substantially reduces variance while also suppressing bias. Empirical results on our newly established suite of benchmarks, called OfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts and reward noise are large.1 | Daniel Yiming Cao, Haruka Kiyohara, Thorsten Joachims, Yuta Saito | |||
| 130 | A Language Model-Based Playlist Generation Recommender System | 0 | The title of a playlist often reflects an intended mood or theme, allowing creators to easily locate their content and enabling other users to discover music that matches specific situations and needs. This work presents a novel approach to playlist generation using language models to leverage the thematic coherence between a playlist title and its tracks. Our method consists in creating semantic clusters from text embeddings, followed by fine-tuning a transformer model on these thematic clusters. Playlists are then generated considering the cosine similarity scores between known and unknown titles and applying a voting mechanism. Performance evaluation, combining quantitative and qualitative metrics, demonstrates that using the playlist title as a seed provides useful recommendations, even in a zero-shot scenario. | Eléa Vellard, Enzo CharoloisPasqua, Pasquale Lisena, Raphaël Troncy, Youssra Rebboud | |||
| 131 | GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization | 0 | Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9 | Aashika Padmanabhan, Abhishek Kulkarni, Anjana Ganesh, Ashish Ranjan, Evren Körpeoglu, Jason H. D. Cho, Jianpeng Xu, Kai Zhao, Kamiya Motwani, Kannan Achan, Kaushiki Nag, Lalitesh Morishetti, Luyi Ma, Malay Patel, Praveenkumar Kanumala, Sumit Dutta, Sushant Kumar, Wanjia Zhang | |||
| 132 | Integrating Individual and Group Fairness for Recommender Systems through Social Choice | 0 | Fairness in recommender systems is a complex concept, involving multiple definitions, different parties for whom fairness is sought, and various scopes over which fairness might be measured. Researchers seeking fairness-aware systems have derived a variety of solutions, usually highly tailored to specific choices along each of these dimensions, and typically aimed at tackling a single fairness concern, i.e., a single definition for a specific stakeholder group and measurement scope. However, in practical contexts, there are a multiplicity of fairness concerns within a given recommendation application and solutions limited to a single dimension are therefore less useful. We explore a general solution to recommender system fairness using social choice methods to integrate multiple heterogeneous definitions. In this paper, we extend group-fairness results from prior research to provider-side individual fairness, demonstrating in multiple datasets that both individual and group fairness objectives can be integrated and optimized jointly. We identify both synergies and tensions among different objectives with individual fairness correlated with group fairness for some groups and anti-correlated with others. | Amanda Aird, Anas Buhayh, Cassidy All, Elena Stefancova, Martin Homola, Nicholas Mattei, Robin Burke | |||
| 133 | LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders | 0 | Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users. | Bo Han, Di Chen, Di Wu, Hui Lu, Huizhi Yang, Lele Yu, Peng Xu, Qin Ren, Shiru Ren, Sijun Zhang, Wenlin Zhao, Xiang Sun, Xijun Xiao, Xionghang Xie, Yaocheng Tan, Yuchao Zheng, Zheng Chai | |||
| 134 | Measuring Interaction-Level Unlearning Difficulty for Collaborative Filtering | 0 | The growing emphasis on data privacy and user controllability mandates that recommendation models support the removal of specified data, known as recommendation unlearning (RU). Although retraining is often regarded as the gold standard for machine unlearning, it is inadequate to attain complete unlearning in collaborative filtering recommendation due to the interdependency between user-item interactions. To this end, we introduce the concept of interaction-level unlearning difficulty, which serves as a foresighted indicator of the unlearning incompleteness or actual unlearning effectiveness after forgetting each interaction. Through extensive experiments with retraining and model-agnostic unlearning methods, we identify two interpretable data characteristics that can serve as useful unlearning difficulty indicators: Embedding Entanglement Index (EEI) and Subgraph Average Degree (AD). They have a strong correlation with existing membership inference metrics focusing on data removal as well as our proposed unlearning effectiveness metrics from the recommendation perspective—Score Shift, UnlearnMRR, and UnlearnRecall. In addition, we investigate the efficacy of an unlearning enhancement technique named Extra Deletion in handling unlearning requests of different difficulty levels. The results show that more related interactions need to be extra deleted to achieve acceptable unlearning effectiveness for difficult unlearning requests, while fewer or no extra deletions are needed for easier-to-forget requests. This study provides a novel perspective for advancing the development of more tailored RU methods. Our code is available at https://gitlab.com/hcdou/cf-unlearn-difficulty. | Haocheng Dou, Tao Lian, Xin Xin | |||
| 135 | NLGCL: Naturally Existing Neighbor Layers Graph Contrastive Learning for Recommendation | 0 | Graph Neural Networks (GNNs) are widely used in collaborative filtering to capture high-order user-item relationships. To address the data sparsity problem in recommendation systems, Graph Contrastive Learning (GCL) has emerged as a promising paradigm that maximizes mutual information between contrastive views. However, existing GCL methods rely on augmentation techniques that introduce semantically irrelevant noise and incur significant computational and storage costs, limiting effectiveness and efficiency. To overcome these challenges, we propose NLGCL, a novel contrastive learning framework that leverages naturally contrastive views between neighbor layers within GNNs. By treating each node and its neighbors in the next layer as positive pairs, and other nodes as negatives, NLGCL avoids augmentation-based noise while preserving semantic relevance. This paradigm eliminates costly view construction and storage, making it computationally efficient and practical for real-world scenarios. Extensive experiments on four public datasets demonstrate that NLGCL outperforms state-of-the-art baselines in effectiveness and efficiency. | Edith C. H. Ngai, Hewei Wang, Jinfeng Xu, Jinze Li, Shuo Yang, Wei Wang, Xiping Hu, Zheyu Chen | |||
| 136 | Off-Policy Evaluation and Learning for Matching Markets | 0 | Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B tests remain the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the large scale and bidirectional nature of user interactions in matching platforms introduce variance issues and exacerbate reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, DiPS and DPR, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control in matching markets. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and A/B testing logs from a real job-matching platform. The empirical results highlight the superiority of our approach over existing methods in off-policy evaluation and learning tasks for a variety of configurations. | Shuhei Goda, Yudai Hayashi, Yuta Saito | |||
| 137 | R4ec: A Reasoning, Reflection, and Refinement Framework for Recommendation Systems | 0 | Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking. This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference. To this end, in this paper, we propose R^4ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback. Then the actor model will refine its response based on the feedback, ultimately leading to improved responses. We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking. Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of R^4ec. We also deploy R^4ec on a large scale online advertising platform, showing 2.2% increase of revenue. Furthermore, we investigate the scaling properties of the actor model and reflection model. | Chi Lu, Hao Gu, Kun Gai, Peng Jiang, Rui Zhong, Wei Yang, Yu Xia | |||
| 138 | Scalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values | 0 | Machine learning-powered recommendation systems help users find items they like. Issues in the interaction data processed by these systems frequently lead to problems, e.g., to the accidental recommendation of low-quality products or dangerous items. Such data issues are hard to anticipate upfront, and are typically detected post-deployment after they have already impacted the user experience. We argue that a principled data debugging process is required during which human experts identify potentially hurtful data issues and preemptively mitigate them. Recent notions of “data importance,” such as the Data Shapley value (DSV), represent a promising direction to identify training data points likely to cause issues. However, the scale of real-world interaction datasets makes it infeasible to apply existing techniques to compute the DSV in recommendation scenarios. We tackle this problem by introducing the KMC-Shapley algorithm for the scalable estimation of Data Shapley values in neighbor-hood-based recommendation on sparse interaction data. We conduct an experimental evaluation of the efficiency and scalability of our algorithm on both public and proprietary datasets with millions of interactions, and showcase that the DSV identifies impactful data points for two recommendation tasks in e-commerce. Furthermore, we discuss applications of the DSV on real-world click and purchase data in e-commerce, such as identifying dangerous products or improving the ecological sustainability of product recommendations. | Barrie Kersbergen, Bojan Karlas, Maarten de Rijke, Olivier Sprangers, Sebastian Schelter | |||
| 139 | RecPS: Privacy Risk Scoring for Recommender Systems | 0 | Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimal privacy protection, e.g., via controlled access. Users of such systems should have the right to choose not to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference attack (MIA)- based privacy scoring method, RecPS, to measure privacy risks at both the interaction and user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning. | Jiajie He, Keke Chen, Yuechun Gu | |||
| 140 | Recommendation and Temptation | 0 | Traditional recommender systems based on revealed preferences often fail to capture the fundamental duality in user behavior, where consumption choices are driven by both inherent value (enrichment) and instant appeal (temptation). Consequently, these systems may generate recommendations that prioritize short-term engagement over long-lasting user satisfaction. We propose a novel recommender design that explicitly models the tension between enrichment and temptation. We introduce a behavioral model that accounts for how both enrichment and temptation influence user choices, while incorporating the reality of off-platform alternatives. Building on this model, we formulate a novel recommendation objective aligned with maximizing consumed enrichment and prove the optimality of a locally greedy recommendation strategy. Finally, we present an estimation framework that leverages the distinction between explicit user feedback and implicit choice data while making minimal assumptions about off-platform options. Through comprehensive evaluation using both synthetic simulations and real-world data from the MovieLens dataset, we demonstrate that our approach consistently outperforms competitive baselines that ignore temptation dynamics either by assuming revealed preferences or recommending solely based on enrichment. Our work represents a paradigm shift toward more nuanced and user-centric recommender design, with significant implications for developing responsible AI systems that genuinely serve users' long-term interests rather than merely maximizing engagement. | Grant Schoenebeck, Md Sanzeed Anwar, Paramveer S. Dhillon | |||
| 141 | You Don't Bring Me Flowers: Mitigating Unwanted Recommendations Through Conformal Risk Control | 0 | Recommenders are significantly shaping online information consumption. While effective at personalizing content, these systems increasingly face criticism for propagating irrelevant, unwanted, and even harmful recommendations. Such content degrades user satisfaction and contributes to significant societal issues, including misinformation, radicalization, and erosion of user trust. Although platforms offer mechanisms to mitigate exposure to undesired content, these mechanisms are often insufficiently effective and slow to adapt to users' feedback. This paper introduces an intuitive, model-agnostic, and distribution-free method that uses conformal risk control to provably bound unwanted content in personalized recommendations by leveraging simple binary feedback on items. We also address a limitation of traditional conformal risk control approaches, i.e., the fact that the recommender can provide a smaller set of recommended items, by leveraging implicit feedback on consumed items to expand the recommendation set while ensuring robust risk mitigation. Our experimental evaluation on data coming from a popular online video-sharing platform demonstrates that our approach ensures an effective and controllable reduction of unwanted recommendations with minimal effort. The source code is available here: https://github.com/geektoni/mitigating-harm-recsys. | Andrea Passerini, Bruno Lepri, Cristian Consonni, Emilia Gómez, Erasmo Purificato, Giovanni De Toni | |||
| 142 | A Multistakeholder Approach to Value-Driven Co-Design of Recommender Systems Evaluation Metrics in Digital Archives | 0 | This paper presents the first multistakeholder approach for translating diverse stakeholder values into an evaluation metric setup for Recommender Systems (RecSys) in digital archives. While commercial platforms mainly rely on engagement metrics, cultural heritage domains require frameworks that balance competing priorities among archivists, platform owners, researchers, and other stakeholders. To address this challenge, we conducted high-profile focus groups (5 groups x 5 persons) with upstream, provider, system, consumer, and downstream stakeholders, identifying value priorities across critical dimensions: visibility/representation, expertise adaptation, and transparency/trust. Our analysis shows that stakeholder concerns naturally align with four sequential research funnel stages: discovery, interaction, integration, and impact. The resulting framework addresses domain-specific challenges including collection representation imbalances, non-linear research patterns, and tensions between specialized expertise and broader accessibility. We propose tailored metrics for each stage in this research journey, such as research path quality for discovery, contextual appropriateness for interaction, metadata-weighted relevance for integration, and cross-stakeholder value alignment for impact assessment. Our contributions extend beyond digital archives to the broader RecSys community, offering transferable evaluation approaches for domains where value emerges through sustained engagement rather than immediate consumption. | Dominik Kowald, Florian AtzenhoferBaumgartner, Georg Vogeler | |||
| 143 | Beyond Visit Trajectories: Enhancing POI Recommendation via LLM-Augmented Text and Image Representations | 0 | Recommender systems often rely on user visit trajectories, but the integration and representation of diverse side information remains a key challenge. Recent advances in large language models (LLMs) have enabled new strategies for enhancing this process. This study investigates how different types of side information support next Point-of-Interest (POI) recommendation, using a business-level dataset derived from Yelp. An LLM-based summarization pipeline is introduced to convert unstructured reviews and visual content into structured text via instruction-tuned models. These summaries, together with other business features, are each encoded into fixed-length embeddings. Based on these embeddings, four input configurations are constructed for BERT4Rec: trajectory-only, single feature categories, pairwise category combinations, and full combination. Our results show that side information consistently improves performance over the trajectory-only baseline, and their combinations exhibit useful synergies. These findings highlight the importance of modality-aware design and point toward adaptive fusion and selective use of side information. To support further research, we publicly release a multimodal POI recommendation dataset based on the Yelp Open Dataset. | Dietmar Jannach, Wolfram Höpken, Zehui Wang | |||
| 144 | Beyond Top-1: Addressing Inconsistencies in Evaluating Counterfactual Explanations for Recommender Systems | 0 | Explainability in recommender systems (RS) remains a pivotal yet challenging research frontier. Among state-of-the-art techniques, counterfactual explanations stand out for their effectiveness, as they show how small changes to input data can alter recommendations, providing actionable insights that build user trust and enhance transparency. Despite their growing prominence, the evaluation of counterfactual explanations in RS is far from standardized. Specifically, existing metrics show inconsistency since they are affected by variations in the performance of the underlying recommenders. Hence, we critically examine the evaluation of counterfactual explainers through consistency as the key principle of effective evaluation. Through extensive experiments, we assess how going beyond top-1 recommendation and incorporating top-k recommendations impacts the consistency of existing evaluation metrics. Our findings reveal factors that impact the consistency of existing evaluation metrics and offer a step toward effectively mitigating the inconsistency problem in counterfactual explanation evaluation. | Amir Reza Mohammadi, Andreas Peintner, Eva Zangerle, Michael Müller | |||
| 145 | Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations | 0 | Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS. | Cedric Waterschoot, Francesco Barile, Nava Tintarev | |||
| 146 | D-RDW: Diversity-Driven Random Walks for News Recommender Systems | 0 | This paper introduces Diversity-Driven RandomWalks (D-RDW), a lightweight algorithm and re-ranking technique that generates diverse news recommendations. D-RDW is a societal recommender, which combines the diversification capabilities of the traditional random walk algorithms with customizable target distributions of news article properties. In doing so, our model provides a transparent approach for editors to incorporate norms and values into the recommendation process. D-RDW shows enhanced performance across key diversity metrics that consider the articles' sentiment and political party mentions when compared to state-of-the-art neural models. Furthermore, D-RDW proves to be more computationally efficient than existing approaches. | Abraham Bernstein, Lucien Heitz, Oana Inel, Runze Li | |||
| 147 | Emotion Vector-Based Fine-Tuning of Large Language Models for Age-Aware Teenage Book Recommendations | 0 | Reading is a vital skill for teenagers as described by the National Institute of Child Health and Human Development, “Reading is the single most important skill necessary for a happy, productive, and successful life." Yet, teens and their parents often struggle to find engaging books amid an overwhelming number of options. Moreover, existing book recommender systems rely heavily on user data such as profiles, reviews, or browsing behavior—information often restricted for minors due to privacy laws. To address this, we propose a privacy-conscious, teenage book recommender system that analyzes the emotional content of books using the NRC Emotion Intensity Lexicon (NRC-EIL). By extracting emotion vectors from book descriptions, we capture each book’s emotional tone and intensity. Our system then uses patterns in emotional preferences across age groups to recommend books that align with teen readers’ developmental and emotional needs. While LLMs can make content-based book recommendations for teenagers as well, they still face challenges like training bias, limited sensitivity to age-specific nuances, and lack of transparency. By integrating our emotion vector approach, we fine-tune LLMs to better detect age-relevant emotional cues, enhancing their ability to suggest meaningful and appropriate content for teen audiences. Experimental results confirm that fine-tuning LLMs with our emotional vector approach significantly enhances their ability to generate accurate, age-appropriate book recommendations for teenagers. | Joey Sherrill, Kate Hill, YiuKai Ng | |||
| 148 | HiDePCC: A Novel Dual-Pronged Untargeted Attack on Federated Recommendation via Gradient Perturbation and Cluster Crafting | 0 | Federated recommender systems offer privacy benefits by decentralizing user data and preventing direct data sharing among clients. Although this architecture limits the effectiveness of traditional attack strategies, it remains susceptible to subtle adversarial attacks that can significantly degrade the accuracy of recommendations. To expose these vulnerabilities, we propose a novel untargeted attack (HiDePCC) that degrades overall system performance through a dual-pronged strategy combining adaptive gradient perturbation and hierarchical cluster-based embedding manipulation. We apply adaptive perturbations to item gradients during training and employ hierarchical clustering using several linkage methods to form coherent item clusters. Within these clusters, we converge item embeddings and manipulate boundary points to induce item misclassification. This causes the system to assign similar scores to clustered items and misrank them. We evaluated our attack on two benchmark datasets, MovieLens (with 0.5% and 1% malicious users) and Gowalla (1%), using Matrix Factorization as the base recommendation model and assessing the impact in various robust aggregation techniques. We also examined several permutations of configurations using hierarchical clustering, adaptive gradient perturbation and boundary points misclassification. Our results show that the complete setup outperforms existing state-of-the-art untargeted attacks, with performance drops for HR@5 ranging from 13.93% to 68.02% on MovieLens and ranging from 40.02% and 99.76% on Gowalla dataset. These findings reveal important vulnerabilities in federated recommendation systems. | Krishna Tewari, Sukomal Pal, Yamini Jha | |||
| 149 | SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation | 0 | Recommender systems (RecSys) are essential for online platforms, providing personalized suggestions to users within a vast sea of information. Self-supervised graph learning seeks to harness high-order collaborative filtering signals through unsupervised augmentation on the user-item bipartite graph, primarily leveraging a multi-task learning framework that includes both supervised recommendation loss and self-supervised contrastive loss. However, this separate design introduces additional graph convolution processes and creates inconsistencies in gradient directions due to disparate losses, resulting in prolonged training times and sub-optimal performance. In this study, we introduce a unified framework of Supervised Graph Contrastive Learning for recommendation (SGCL) to address these issues. SGCL uniquely combines the training of recommendation and unsupervised contrastive losses into a cohesive supervised contrastive learning loss, aligning both tasks within a single optimization direction for exceptionally fast training. Extensive experiments on three real-world datasets show that SGCL outperforms state-of-the-art methods, achieving superior accuracy and efficiency. | Henry Peng Zou, Ke Xu, Liangwei Yang, Philip S. Yu, Weizhi Zhang, Yuanjie Zhu, Zihe Song | |||
| 150 | Are We Really Making Recommendations Robust? Revisiting Model Evaluation for Denoising Recommendation | 0 | Implicit feedback data has emerged as a fundamental component of modern recommender systems due to its scalability and availability. However, the presence of noisy interactions—such as accidental clicks and position bias—can potentially degrade recommendation performance. Recently, denoising recommendation have emerged as a popular research topic, aiming to identify and mitigate the impact of noisy samples to train robust recommendation models in the presence of noisy interactions. Although denoising recommendation methods have become a promising solution, our systematic evaluation reveals critical reproducibility issues in this growing research area. We observe inconsistent performance across different experimental settings and a concerning misalignment between validation metrics and test performance caused by distribution shifts. Through extensive experiments testing 6 representative denoising methods across 4 recommender models and 3 datasets, we find that no single denoising approach consistently outperforms others, and simple improvements to evaluation strategies can sometimes match or exceed state-of-the-art denoising methods. Our analysis further reveals concerns about denoising recommendation in high-noise scenarios. We identify key factors contributing to reproducibility defects and propose pathways toward more reliable denoising recommendation research. This work serves as both a cautionary examination of current practices and a constructive guide for the development of more reliable evaluation methodologies in denoising recommendation. | Guangquan Zhang, Guohang Zeng, Jie Lu | |||
| 151 | Context Trails: A Dataset to Study Contextual and Route Recommendation | 0 | Recommender systems in the tourism domain are gaining increasing attention, yet the development of diverse recommendation tasks remains limited, largely due to the scarcity of public datasets. This paper introduces Context Trails, a novel dataset addressing this gap. Context Trails distinguishes itself by including not only user interactions with touristic venues, but also the itineraries (trails or routes) followed by users. Furthermore, it enriches existing item features (e.g., category, coordinates) with contextual attributes related to the interaction moment (e.g., weather) and the venue itself (e.g., opening hours). Beyond a detailed description of the dataset’s characteristics, we evaluate the performance of several baseline algorithms across three distinct recommendation tasks: classical recommendation, route recommendation, and contextual recommendation. We believe this dataset will foster further research and development of advanced recommender systems within the tourism domain. Dataset is available at https://zenodo.org/records/15855966; further code available at https://github.com/pablosanchezp/ContextTrailsExperiments. | Alejandro Bellogín, Jose L. JorroAragoneses, Pablo Sánchez | |||
| 152 | GreenFoodLens: Sustainability Labels for Food Recommendation | 0 | Most food recommender systems aim to boost user engagement by analyzing recipe ingredients and users’ past choices. Even though consumers are paying more attention to sustainability, such as carbon and water footprints, there remains a notable lack of public corpora that combine detailed user–recipe interactions with reliable environmental impact data. This gap makes it hard to build recommendation tools that both match people’s tastes and help reduce ecological damage. To this end, we present GreenFoodLens, a resource that enriches HUMMUS, one of the largest corpora for food recommendation, with environmental impact estimates derived from the hierarchical taxonomy of the SU-EATABLE-LIFE project. We achieved this result through a multi-step process involving human annotations, iterative labeling assessments, knowledge refinement, and constrained generation techniques with large language models. Finally, we evaluate recommendation baselines on HUMMUS augmented with GreenFoodLens labels and find that models are driven by popularity signals, which may exacerbate the environmental impact of users’ recipe choices. These experiments demonstrate the practical benefit of GreenFoodLens for benchmarking and advancing sustainability-aware recommendation research. The resource is available at https://github.com/tail-unica/GreenFoodLens. | Giacomo Balloccu, Giacomo Medda, Gianni Fenu, Giovanni Murgia, Ludovico Boratto, Mirko Marras | |||
| 153 | How Powerful are LLMs to Support Multimodal Recommendation? A Reproducibility Study of LLMRec | 0 | Large language models (LLMs) have been exploited as standalone recommender systems (RSs) and, more recently, as support tools for already existing RSs. A notable example of the latter is LLMRec [28], which prompts a LLM with the user-item data, the items’ metadata, and the candidate items generated by other multimodal RSs to obtain an augmented version of the original dataset where a final RS is trained on. While a few recent studies have proposed reproducing and rigorously evaluating LLM-based recommender systems (RSs) as standalone approaches (first research line), little to no attention has been devoted to exploring the use of LLMs as supportive components within existing RSs, particularly in the context of multimodal recommendation (second research line). To this end, in this work, we propose the first reproducibility study of a LLMs-based RS belonging to the second research line, LLMRec, in the multimodal recommendation domain. First, we try to replicate the results of LLMRec with the authors’ provided data and our own reconstructed data, outlining critical issues in the measured recommendation performance. Then, we benchmark LLMRec: (i) with unimodal and multimodal LLMs, showing how the latter may be more beneficial in a multimodal scenario; (ii) other competitive multimodal RSs, LLMs-based solutions, and an additional dataset, demonstrating inconsistencies with the trends emerging in the original paper. Finally, in an attempt to disentangle the observed performance trends, we evaluate (for the first time in the literature) the topological differences of the original user-item graph to the LLMRec’s augmented one. | Alessia Preziosa, Claudio Pomo, Daniele Malitesta, Fedelucio Narducci, Maria Lucia Fioretti, Nicola Laterza, Tommaso Di Noia | |||
| 154 | Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text" | 0 | Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems. | Dominykas Seputis, Karsten Langerak, Serghei Mihailov, Yongkang Li | |||
| 155 | Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective | 0 | The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To examine this threat, we present the first systematic study on inversion attacks targeting LLM empowered recommender systems, where adversaries attempt to reconstruct original prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We reproduce the vec2text framework and optimize it using our proposed method called Similarity Guided Refinement, enabling more accurate reconstruction of textual prompts from model generated logits. Extensive experiments across two domains (movies and books) and two representative LLM based recommendation models demonstrate that our method achieves high fidelity reconstructions. Specifically, we can recover nearly 65 percent of the user interacted items and correctly infer age and gender in 87 percent of the cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model's performance but highly dependent on domain consistency and prompt complexity. These findings expose critical privacy vulnerabilities in LLM empowered recommender systems. | Min Tang, Nuo Shen, Shujie Cui, Weiqing Wang, Yubo Wang | |||
| 156 | Agentic Personalisation of Cross-Channel Marketing Experiences | 0 | Consumer applications provide ample opportunities to surface and communicate various forms of content to users. From promotional campaigns for new features or subscriptions, to evergreen nudges for engagement, or personalised recommendations; across e-mails, push notifications, and in-app surfaces. The conventional approach to orchestration for communication relies heavily on labour-intensive manual marketer work, and inhibits effective personalisation of content, timing, frequency, and copy-writing. We formulate this task under a sequential decision-making framework, where we aim to optimise a modular decision-making policy that maximises incremental engagement for any funnel event. Our approach leverages a Difference-in-Differences design for Individual Treatment Effect estimation, and Thompson sampling to balance the explore-exploit trade-off. We present results from a multi-service application, where our methodology has resulted in significant increases to a variety of goal events across several product features, and is currently deployed across 150 million users. | Eleanor Hanna, Olivier Jeunen, Sami Abboud, Schaun Wheeler, Vineesha Raheja | |||
| 157 | Balanced Public Service Media Recommendation Trade-offs with a Light Carbon Footprint | 0 | Public service media (PSM) providers commonly face the challenge of balancing user engagement metrics and public value. In this case study, we report on the insights obtained at ARD, Germany’s largest PSM provider, when investigating the effectiveness of different collaborative filtering techniques on their video-on-demand platform ARD Mediathek. While an offline evaluation indicated that a modern model based on a denoising auto-encoder might lead to the best prediction accuracy, A/B testing revealed that an item-based nearest-neighbor technique excelled both in terms of engagement and public value metrics. Our findings thus suggest that traditional, light-weight models should not be easily dismissed, given also their comparably limited resource requirements and light carbon footprint. To enable future research on this topic, we provide a real-world dataset with usage data from our platform. | David Wittenberg, Dietmar Jannach, Juri Diels, Marcel Hauck, Michael Huber | |||
| 158 | Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates | 0 | Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems. | Changping Meng, Dapeng Hong, Ed H. Chi, Haokai Lu, Hongyi Ling, Jianling Wang, Lichan Hong, Mingyan Gao, Ningren Han, Onkar Dalal, Shuzhou Zhang, Yifan Liu | |||
| 159 | LADDER: LLM-Annotated Data for Dogfooded Evaluation of Rankings | 0 | In this paper we showcase the implementation of LADDER: A method utilizing Large Language Model to annotate thousands of consumer reviews to train a point-wise learning to rank algorithm. By applying LADDER, we significantly improved the relevance of the top 4 reviews presented to users, demonstrably reducing the need to access the full review collection by 5%. This outcome highlights LADDER’s ability to enhance user experience by providing sufficient information within the initial review set, thereby streamlining the decision-making process. We discuss the efficiency gains in large-scale data labeling, the positive impact on trust and relevance in review presentation without sacrificing usability, and key insights into effectively integrating domain expertise into LLM annotation for high-quality results. | Mattia Ottoborgo | |||
| 160 | LLM-Powered Nuanced Video Attribute Annotation for Enhanced Recommendations | 0 | This paper presents a case study of deploying Large Language Models (LLMs) as an advanced "annotation" mechanism to achieve nuanced content understanding (e.g., discerning content "vibe") at scale within an industrial short-form video recommendation system. Traditional machine learning classifiers for content understanding face protracted development cycles and a lack of deep, nuanced comprehension. The "LLM-as-annotators" approach addresses these by significantly shortening development times and enabling the annotation of subtle attributes. This work details an end-to-end workflow encompassing: (1) iterative definition and robust evaluation of target attributes, refined by offline metrics and online A/B testing; (2) scalable offline bulk annotation of video corpora using LLMs with multimodal features, optimized inference, and knowledge distillation for broad application; and (3) integration of these rich annotations into the online recommendation serving system, for example, through personalized restricted retrieval. Experimental results demonstrate the efficacy of this approach, with LLMs outperforming human raters in offline annotation quality for nuanced attributes and yielding significant improvements in user participation and satisfied consumption in online A/B tests. The study provides insights into designing and scaling production-level LLM pipelines for content annotation, highlighting the adaptability and benefits of LLM-based multimodal content understanding for enhancing video discovery, user satisfaction, and the overall effectiveness of modern recommendation systems. | Boyuan Long, Changping Meng, Dapeng Hong, Hiloni Mehta, Mick Zomnir, Mingyan Gao, Ningren Han, Omkar Pathak, Onkar Dalal, Ruolin Jia, Xia Wu, Yajun Peng, Yueqi Wang | |||
| 161 | Not All Impressions Are Created Equal: Psychology-Informed Retention Optimization for Short-Form Video Recommendation | 0 | Recommender systems that are optimized only for short-term engagement can lead to undesirable outcomes and hurt long-term consumer experience. In response, researchers and practitioners have proposed to incorporate retention signals into recommender systems. Existing retention models are built on item-level interactions where every impression is weighted equally. However, on short-form video platforms where content is presented sequentially and passively consumed, users are unlikely to engage equally with every video, and it is hard to establish any meaningful relationships between a short video watch and long-term retention behaviors. In this work, we propose a psychology-informed retention modeling approach grounded in the peak–end rule, which suggests that people evaluate past experiences largely based on the most intense moment (“peak”) and the final moment (“end”). Specifically, we train a retention model that predicts user return based on the peak and end moments of each session, which is then incorporated into a multi-stage recommender system. We implemented our approach on Facebook Reels, one of the world’s largest short-form video recommendation platforms. In a long-term A/B test against the production system, our model delivered significant improvements in Daily Active Users and total sessions, suggesting an improved long-term user experience. | Chuanqi Wei, Jing Zhong, Yanchen Wang, Yuxin Cui, Yuyan Wang, Zellux Wang, Zhaohui Guo | |||
| 162 | Operational Twin-Driven AI Recommender for Strategic Service Planning | 0 | Traditional service management relies heavily on manual processes due to data complexity and human involvement, limiting the impact of AI in strategic planning. We present an AI recommender system that leverages an operational twin of service operations to optimize long-term KPIs using Monte Carlo Search and Mixed-Integer programming. Focusing on personnel allocation for large healthcare equipment, the system accounts for domain-specific constraints like specialization and continuity. We deployed the system at Siemens Healthineers to support over 300,000 equipment across the U.S. and report productivity gains from over a year of real-world use and key lessons for adoption at scale. | Ankur Kapoor, Chetan Srinidhi, Codruta Ene, Dorin Comaniciu, Neil Biehn, Santosh Pai, Sarith Mohan, Ullaskrishnan Poikavila, Vivek Singh | |||
| 163 | RADAR: Recall Augmentation through Deferred Asynchronous Retrieval | 0 | Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top-ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre-ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints. | Ajantha Ramineni, Amit Jaspal, Qian Dang | |||
| 164 | Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations | 0 | Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders. | Andreas Damianou, Marco De Nadai, Mounia Lalmas | |||
| 165 | eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion | 0 | Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked - this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec's training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23 | Aleksandr V. Petrov, Andrei Semenov, Andrey V. Savchenko, Daria Tikhonovich, Mayya Spirina, Nikita Zelinskiy, Sergei Kuliev | |||
| 166 | Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge | 0 | Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user's interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems. | Alice Wang, Andreas Damianou, Edoardo D'Amico, Francesco Fabbri, Gustavo Penha, Jackie Doremus, Marco De Nadai, Mounia Lalmas, Oskar Stål, Paul Gigioli | |||
| 167 | Fine-tuning for Inference-efficient Calibrated Recommendations | 0 | Calibration is the degree to which a recommender system is able to match the distribution of a certain item attribute among the items consumed by a user with their respective recommendations. Recent work suggests that many recommenders tend to provide miscalibrated recommendations. Furthermore, most approaches aimed at improving calibration adopt the post-processing paradigm, making them computationally costly at the inference time. This work proposes CaliTune, a fine-tuning approach applied to collaborative filtering based recommenders to allow them generate better calibrated recommendations without relying on costly post-processing. We compare CaliTune to an established post-processing approach on two backbone models and datasets from movie and music domains, focusing on popularity calibration. Our results suggest that CaliTune can offer a competitive accuracy–calibration trade-off in several settings, particularly when the backbone model exhibits high miscalibration and accuracy remains important, making it a promising inference-efficient alternative in such cases. | Adrian Bajko, Antonela Tommasel, Markus Schedl, Matthias Wenzel, Max Walder, Oleg Lesota | |||
| 168 | How Fair is Your Diffusion Recommender Model? | 0 | Diffusion-based learning has settled as a rising paradigm in generative recommendation, outperforming traditional approaches built upon variational autoencoders and generative adversarial networks. Despite their effectiveness, concerns have been raised that diffusion models - widely adopted in other machine-learning domains - could potentially lead to unfair outcomes, since they are trained to recover data distributions that often encode inherent biases. Motivated by the related literature, and acknowledging the extensive discussion around bias and fairness aspects in recommendation, we propose, to the best of our knowledge, the first empirical study of fairness for DiffRec, chronologically the pioneer technique in diffusion-based recommendation. Our empirical study involves DiffRec and its variant L-DiffRec, tested against nine recommender systems on two benchmarking datasets to assess recommendation utility and fairness from both consumer and provider perspectives. Specifically, we first evaluate the utility and fairness dimensions separately and, then, within a multi-criteria setting to investigate whether, and to what extent, these approaches can achieve a trade-off between the two. While showing worrying trends in alignment with the more general machine-learning literature on diffusion models, our results also indicate promising directions to address the unfairness issue in future work. The source code is available at https://github.com/danielemalitesta/FairDiffRec. | Daniele Malitesta, Erasmo Purificato, Fragkiskos D. Malliaros, Giacomo Medda, Ludovico Boratto, Mirko Marras | |||
| 169 | Investigating Carbon Footprint of Recommender Systems Beyond Training Time | 0 | The environmental footprint of recommender systems has received growing attention in the research community. While recent work has examined the trade-off between model accuracy and the estimated carbon emissions during training, we argue that a comprehensive evaluation should also account for the emissions produced during inference time, especially in applications where models are deployed for extended periods with frequent inference cycles. In this study, we extend previous carbon footprint analyses from the literature by incorporating the inference phase into the carbon footprint assessment and exploring how variations in training configurations affect emissions. Our findings reveal that models with higher training emissions can, in some cases, offer lower environmental costs at inference time. Moreover, we show that minimizing the number of validation metrics computed during training can lead to significant reductions in overall carbon footprint, highlighting the importance of thoughtful experimental design in sustainable machine learning. | Antonela Tommasel, Josef Schodl, Markus Schedl, Oleg Lesota | |||
| 170 | Learning geometry-aware recommender systems with manifold regularization | 0 | Recent work shows that hyperbolic geometry may be a better option for recommendation systems in some cases due to the natural hierarchy present in user demands. However, the choice of geometry often determines the model architecture by fixing the type of embedding. This paper discusses the manifold regularization problem statement, which allows for preserving the original architecture and standard embeddings while imposing a non-strict geometry constraint. We demonstrate using hyperbolic geometry for neural collaborative filtering in two distinct recommendation tasks based on multilayer perceptron (MLP) networks: top-k recommendation and explicit rating prediction. For a more comprehensive architecture, we also test SASRec. All tasks are evaluated on the Amazon Reviews and MovieLens1M datasets. Experiments show that manifold regularization achieves performance comparable to hyperbolic embeddings on datasets with hierarchical structure without requiring changes to the model architecture and thus leaves initial model inference unaffected. | Alexander Hvatov, Julia Borisova, Zaira Zainulabidova | |||
| 171 | Leveraging Geometric Insights in Hyperbolic Triplet Loss for Improved Recommendations | 0 | Recent studies have demonstrated the potential of hyperbolic geometry for capturing complex patterns from interaction data in recommender systems. In this work, we introduce a novel hyperbolic recommendation model that uses geometrical insights to improve representation learning and increase computational stability at the same time. We reformulate the notion of hyperbolic distances to unlock additional representation capacity over conventional Euclidean space and learn more expressive user and item representations. To better capture user-items interactions, we construct a triplet loss that models ternary relations between users and their corresponding preferred and nonpreferred choices through a mix of pairwise interaction terms driven by the geometry of data. Our hyperbolic approach not only outperforms existing Euclidean and hyperbolic models but also reduces popularity bias, leading to more diverse and personalized recommendations. | Evgeny Frolov, Maxim Rakhuba, Viacheslav Yusupov | |||
| 172 | Lift It Up Right: A Recommender System for Safer Lifting Postures | 0 | Work-related musculoskeletal disorders, often caused by poor lifting posture and unsafe manual handling, continue to pose a significant threat to worker health and safety. This paper presents a health recommender system designed to prevent injury by assessing and correcting posture for lifting techniques. Leveraging monocular video input, our method estimates key ergonomic parameters to compute the Lifting Index based on the Revised NIOSH Lifting Equation. When the computed Lifting Index exceeds a predefined safety threshold, the system automatically generates graphical and textual recommendations to guide the worker towards safer postural strategies. This safety-aware recommender system provides interpretable and actionable feedback without requiring wearable sensors or multi-camera setups, making it suitable for deployment in real-world workplace environments. By integrating ergonomics with recommender system design, we contribute to a new class of context-aware, safety-oriented recommendation technologies tailored for occupational health. | Gaetano Dibenedetto, Helma Torkamaan, Marco Polignano, Pasquale Lops | |||
| 173 | Normative Alignment of Recommender Systems via Internal Label Shift | 0 | Recommender systems optimized solely for user engagement often fail to meet broader normative objectives such as fairness, diversity, or editorial values. We introduce NAILS (Normative Alignment of recommender systems via Internal Label Shift), a simple and scalable method for aligning recommendation outputs with target distributions over item-level attributes (e.g., categories). NAILS modifies the user-conditional item distribution to induce a specified marginal distribution over attributes, leveraging existing user–item preferences without retraining the model. To achieve this, we recast the problem as a form of label shift applied internally within a hierarchical classification framework. Adopting a stakeholder-centric perspective, NAILS enables alignment with global normative goals. Empirically, we show that NAILS consistently improves attribute-level alignment with minimal impact on user engagement, providing a practical mechanism for value-driven recommendation. Our code is available at https://github.com/johanneskruse/nails. | Jes Frellsen, Johannes Kruse, Julian J. McAuley, Kasper Lindskow, Michael Riis Andersen, PierreAlexandre Mattei, Ryotaro Shimizu | |||
| 174 | Recommendation Is a Dish Better Served Warm | 0 | In modern recommender systems, experimental settings typically include filtering out cold users and items based on a minimum interaction threshold. However, these thresholds are often chosen arbitrarily and vary widely across studies, leading to inconsistencies that can significantly affect the comparability and reliability of evaluation results. In this paper, we systematically explore the cold-start boundary by examining the criteria used to determine whether a user or an item should be considered cold. Our experiments incrementally vary the number of interactions for different items during training, and gradually update the length of user interaction histories during inference. We investigate the thresholds across several widely used datasets, commonly represented in recent papers from top-tier conferences, and on multiple established recommender baselines. Our findings show that inconsistent selection of cold-start thresholds can either result in the unnecessary removal of valuable data or lead to the misclassification of cold instances as warm, introducing more noise into the system. | Danil Gusak, Evgeny Frolov, Nikita Sukhorukov | |||
| 175 | Recurrent Autoregressive Linear Model for Next-Basket Recommendation | 0 | Next-basket recommendation aims to predict the (sets of) items that a user is most likely to purchase during their next visit, capturing both short-term sequential patterns and long-term user preferences. However, effectively modeling these dynamics remains a challenge for traditional methods, which often struggle with interpretability and computational efficiency, particularly when dealing with intricate temporal dependencies and inter-item relationships. In this paper, we propose ReALM, a Recurrent Autoregressive Linear Model that explicitly captures temporal item-to-item dependencies across multiple time steps. By leveraging a recurrent loss function and a closed-form optimization solution, our approach offers both interpretability and scalability while maintaining competitive accuracy. Experimental results on real-world datasets demonstrate that ReALM outperforms several state-of-the-art baselines in both recommendation quality and efficiency, offering a robust and interpretable solution for modern personalization systems. | Antoine Ledent, Martin Spisák, Pavel Kordík, Rodrigo Alves, Tereza Zmeskalová | |||
| 176 | RicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs | 0 | We propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S&P 500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with plans for portfolio optimization and return forecasting. To our knowledge, RicciFlowRec is the first recommender to apply geometric flow-based reasoning in financial decision support. | Anoushka Harit, Zhongtian Sun | |||
| 177 | The Hidden Cost of Defaults in Recommender System Evaluation | 0 | Hyperparameter optimization is critical for improving the performance of recommender systems, yet its implementation is often treated as a neutral or secondary concern. In this work, we shift focus from model benchmarking to auditing the behavior of RecBole, a widely used recommendation framework. We show that RecBole's internal defaults, particularly an undocumented early-stopping policy, can prematurely terminate Random Search and Bayesian Optimization. This limits search coverage in ways that are not visible to users. Using six models and two datasets, we compare search strategies and quantify both performance variance and search path instability. Our findings reveal that hidden framework logic can introduce variability comparable to the differences between search strategies. These results highlight the importance of treating frameworks as active components of experimental design and call for more transparent, reproducibility-aware tooling in recommender systems research. We provide actionable recommendations for researchers and developers to mitigate hidden configuration behaviors and improve the transparency of hyperparameter tuning workflows. | Alan Said, Hannah Berling, Robin Svahn | |||
| 178 | ArtAICare: An End-to-End Platform for Personalized Art Therapy | 0 | We introduce a platform powered by Visual Art recommender systems (VA RecSys) to support art therapy for patients with Post-Intensive Care Syndrome (PICS) or experiencing psychiatric sequelae symptoms such as anxiety, depression, and Post Traumatic Stress Disorder (PTSD). The contribution is threefold: (1) integration of unimodal, multimodal, and cross-domain VA RecSys engines as plug-and-play external APIs for therapeutic art recommendations; (2) development of an end-to-end platform with desktop/mobile/tablet and immersive VR interfaces to connect therapists and patients; and (3) a therapist dashboard providing post-session analytics, including objective and subjective measures, to inform future recommendations. A pilot test with licensed art therapists and patients with PICS demonstrated that the platform enables therapist-supervised personalized therapy, reducing preparation time by 50% and improving affective states by 70.5%. | Bereket Abera Yilma, Luis A. Leiva, Saravanakumar Duraisamy, Stefan Penchev, Tudor Pristav | |||
| 179 | Blooming Beats: An Interactive Music Recommender System Grounded in TRACE Principles and Data Humanism | 0 | Music streaming platforms reduce rich listening experiences to algorithmic black boxes, overlooking personal narratives that make music meaningful. We present Blooming Beats, an explainable recommender system that transforms Spotify listening data into visual narratives using Data Humanism principles. The system embodies TRACE principles: Transparency through visual explanations, Context-awareness by integrating personal context, and Empathy by matching listening stories rather than user profiles. A user study with 8 participants exploring a decade of listening data shows that narrative-driven visualization suggests potential for enhancing transparency and engagement. | Carlos Kirchdorfer, Daniel Lutziger, Ibrahim Al Hazwani, Jürgen Bernard, Luca Huber, Ludovico Boratto, Oliver Robin Aschwanden | |||
| 180 | Large Language Model-based Recommendation System Agents | 0 | A Large Language Model-based agent is an AI assistant that makes use of advanced Tool Calling (TC) and Retrieval Augmented Generation (RAG) techniques to access external tools (e.g., Python code, databases). This allows the agent to consult additional sources of information that are complementary to its pre-trained knowledge. By doing so, re-training or fine-tuning of the LLM each time new knowledge becomes available can be avoided, as the assistant can access this information thanks to the available tools. In this demo, we investigate this idea in the Recommendation Systems (RSs) scenario. In particular, we design an AI assistant for recommendation that can access (i) a pre-trained recommender system, (ii) a database, and (iii) a vector store. The demo shows how the assistant is able to interact with these tools to reply to complex recommendation and explanation queries that require reasoning on the tool’s results. To the best of our knowledge, this is the first attempt at designing LLM-based recommendation system agents. The code for this demo paper is available at this URL. | Brijraj Singh, Niranjan Pedanekar, Tommaso Carraro | |||
| 181 | PRISM: From Individual Preferences to Group Consensus through Conversational AI-Mediated and Visual Explanations | 0 | Group accommodation booking forces travelers to coordinate externally through messaging apps and informal voting, missing opportunities for transparent preference alignment. We present PRISM, an interactive group recommender system that transforms opaque recommendation processes into transparent collaborative visual experiences. PRISM employs a two-phase interaction paradigm: individual preference elicitation through conversational AI, followed by collaborative decision-making via bivariate map preference visualization. A controlled user study with 6 pairs shows PRISM enhances transparency (+1.83 on 5-point scale), consensus building (+2.0), and reduces conformity pressure compared to traditional approaches and interfaces. | Ibrahim Al Hazwani, Jürgen Bernard, Ludovico Boratto, Oana Inel, Oliver Robin Aschwanden | |||
| 182 | Travel Together, Play Together: Gamifying a Group Recommender System for Tourism | 0 | Gamification is increasingly being used in a variety of domains, such as in education to motivate students learning, in healthcare contexts to help patients follow medical indications or improve healthy habits, or even in tourism to enrich the tourists’ experience. Recommender Systems (RS) are an application example, where gamification has been added to motivate and challenge tourists while visiting a destination, but only few use gamification to motivate using the RS itself, and, to the best of our knowledge, there are no Group RS (GRS) that use gamification. Psychological aspects, such as personality, are also being studied to enhance recommendations, since they have shown to produce better results than generic approaches, but to acquire personality without the social desirability bias associated to questionnaires or a great amount of user interactions is a challenge. In previous studies, we showed serious games can be the leverage needed to implicitly acquire the tourists’ personality and improve recommendations without the observer’s bias. In this demo, we show how we gamified a GRS for tourism prototype by using rewards, a virtual pet, and the serious games. | Goreti Marreiros, Joana Neto, Jorge Lima, José Silva, Luís Conceição, Patrícia Alves | |||
| 183 | Beyond Algorithms: Reclaiming the Interdisciplinary Roots of Recommender Systems (BEYOND 2025) | 0 | This workshop challenges the machine learning-centric focus of modern recommender systems research by reconnecting the field with its interdisciplinary origins and exploring the non-algorithmic dimensions that are crucial to effective recommendation. It fosters a space for reflective, critical, and creative discussions on recommender systems that embrace human values, user experiences, and societal impact. The workshop emphasizes methodological diversity and invites contributions from psychology, human-computer interaction, ethics, design, and other disciplines. | Alan Said, Christine Bauer, Eva Zangerle | |||
| 184 | A Tutorial on Agentic LLM for Recommender Systems | 0 | Recent breakthroughs in multimodal large language models (MLLMs) have paved the way for developing agentic recommender systems that go beyond traditional recommendation methods. This tutorial introduces the key concepts and architectures of agentic recommender systems. We discuss how agentic capabilities such as planning, memory, and multimodal reasoning enable proactive recommendation strategies while addressing challenges in explainability, safety, and lifelong personalization. The session is designed for both academic researchers and industry practitioners interested in the next generation of intelligent recommender systems. | Chengkai Huang, Julian J. McAuley, Junda Wu, Lina Yao, Tong Yu | |||
| 185 | A Hands-on Dive Into Quantum Computing for Recommender Systems | 0 | Quantum Computing (QC) has gained attention for its promise to substantially accelerate the solution of many computationally intensive tasks. Recommender Systems (RS), which must operate over large-scale, heterogeneous data using complex algorithms, are one of many application domains where QC may offer computational advantages. This tutorial aims to provide an accessible introduction to QC, with a focus on the Quantum Annealing (QA) paradigm. The tutorial is designed for an audience without prior expertise in quantum technologies and presents a practical overview of how RS problems such as community detection can be modeled using the Quadratic Unconstrained Binary Optimization (QUBO) formulation and solved using QA. Participants will be guided through the theoretical foundations and will gain hands-on experience in formulating and running RS-related tasks for quantum annealers. | Maurizio Ferrari Dacrema, Paolo Cremonesi | |||
| 186 | Standard Practices for Data Processing and Multimodal Feature Extraction in Recommendation with DataRec and Ducho (D&D4Rec) | 0 | Recommendation pipelines involve several stages that can critically affect performance and reproducibility. However, early pipeline stages remain under-standardized, limiting comparability and interoperability across studies. This tutorial addresses this gap by providing both theoretical insights and hands-on experience with tools and practices for standardized data processing in recommender systems. In the first part, we introduce DataRec, a Python library for reproducible and interoperable data management, and discuss data filtering, splitting, and topological analysis techniques. In the second part, we explore multimodal feature extraction in domains such as fashion, music, and movies, focusing on the challenges of meaningful multimodal integration. We introduce Ducho, a unified framework for extracting audio, visual, and textual features using modern backends, and demonstrate its integration with the evaluation framework Elliot. The tutorial targets researchers and practitioners with an interest in recommender systems, data preprocessing, and multimodal modeling. All materials, including slides, code, datasets, and recordings, will be openly available on a dedicated tutorial website: https://sites.google.com/view/dd4rec-tutorial/. | Alberto Carlo Maria Mancino, Angela Di Fazio, Daniele Malitesta, Matteo Attimonelli, Tommaso Di Noia | |||
| 187 | Adding Value to Low-Resource Industrial Recommender Systems | 0 | This research proposes a modular, resource-aware framework for industrial recommender systems that enables the integration and evaluation of stakeholder values at each stage of the recommendation pipeline. Motivated by the practical constraints of data availability and computational capacity, the framework supports stage-wise optimisation and selective retraining, making it suitable for low-resource environments. Ongoing experiments on open-source and real-world datasets aim to validate the framework’s adaptability, offering a contribution to the design of value-aware and operationally viable recommender systems. | Cornelia M. Kloppers | |||
| 188 | Addressing Multi-stakeholder Fairness Concerns in Recommender Systems Through Social Choice | 0 | Fairness in recommender systems has been discussed on the group and individual level with concerns for both providers and consumers. But many current solutions to improving fairness in recommender systems can only address one fairness concern or have limited definitions of fairness. My research revolves around improving fairness in recommender systems with an approach that addresses multiple and complex fairness concerns. I use SCRUF-D (Social Choice for Recommendation Under Fairness - Dynamic), a multi-agent social choice-based architecture, for reranking recommendations to improve fairness across multiple dimensions. My completed research has evaluated trade-offs between accuracy and fairness when reranking for multiple fairness definitions on the provider side. This includes exploring how different social choice rules and agent allocation mechanisms impact this trade-off. Currently, I am focused on expanding these studies to include individual and consumer-side fairness metrics. My ongoing research aims to evaluate the trade-offs between accuracy and fairness, incorporating consumer-side fairness metrics. Research to handle tensions between different types of fairness and human research to demonstrate the value of SCRUF is being planned. | Amanda Aird | |||
| 189 | Are Recommender Systems Serving Children? Toward Child-Aware Design and Evaluation | 0 | Recommender Systems research continuously improves recommendation strategies to meet the needs of a wide range of users and other stakeholders. However, much of this research centers on the traditional, adult user, often overlooking underrepresented demographics. One such group is children, frequent users of platforms driven by recommender systems. Children differ from adults in preferences and can be particularly vulnerable to certain content, raising questions about the harm recommender systems may pose. This PhD project advocates for child-aware recommender systems: systems that explicitly account for children as part of their users, recognizing their distinct needs, vulnerabilities, and rights. In pursuit of this goal, we investigate how well current recommender systems serve children, auditing algorithmic strategies from two complementary perspectives: The ‘traditional’ perspective focuses on whether recommendations align with children’s preferences. The perspective of ‘non-maleficence’ assesses suitability of content recommended, evaluating whether it respects children’s vulnerabilities to potentially harmful material. To do so, we audit current recommender systems according to both perspectives—not only in the short term, but also in the long term through simulation studies. Beyond auditing, we explore strategies and design directions for making recommender systems more responsible. Outcomes from this work should inform both academic and practitioner communities about the gaps in current systems and lay the groundwork for more equitable, safe, and meaningful recommendations for children. | Robin Ungruh | |||
| 190 | Bayesian Perspectives on Offline Evaluation for Recommender Systems | 0 | Offline evaluation is a fundamental component in the deployment and development of better recommender systems. In recent years, the contextual bandit framework has emerged as a valuable approach for offline and counterfactual evaluation, leading to the increasing interest in estimators based on inverse propensity scoring (IPS), direct methods (DM), and doubly robust (DR) techniques. However, nearly all existing methods rely on frequentist statistics, limiting their ability to capture model uncertainty and reflecting it in evaluation outcomes. This work explores the novel research direction of Bayesian statistics for Off-Policy Evaluation in recommendation tasks, motivated by the need for reliable estimators that are more robust to distribution shift, data sparsity, and model misspecification. Three underexplored research directions are identified in this work: (i) using posterior uncertainty from Bayesian reward models to design adaptive hybrid estimators, (ii) explicitly modeling all components of the OPE problem (contexts, actions, and rewards) using a joint probabilistic framework, and (iii) quantifying epistemic uncertainty over policy value estimates via posterior inference. By leveraging the Bayesian framework, the aim is to improve the reliability, interpretability, and safety of offline evaluation protocols, offering a new perspective on one of the most persistent challenges in recommender systems research. This perspective is especially relevant in data-scarce or high-stakes settings, where understanding uncertainty is essential for trustworthy decision-making. | Michael Benigni | |||
| 191 | Beyond Persuasion: Adaptive Warnings and Balanced Explanations for Informed Decision-Making in Recommender Systems | 0 | As recommender systems become deeply embedded in digital platforms, designing explanations that are ethical, effective, and user-centered is increasingly important. Traditional strategies often prioritize persuasiveness or transparency but neglect user agency and cognitive differences. This research explores alternative explanation formats, warnings that highlight potential drawbacks and balanced pros-and-cons summaries, to support more informed and autonomous decision-making. In the first year, we published a paper discussing ethical considerations in explanation design for recommender systems. We then conducted a systematic review of user perceptions, a study of warning messages in mobile app interfaces, and a controlled e-commerce experiment comparing baseline, warning, and pros-and-cons explanations. Results indicate that layered explanations improve decision satisfaction, reduce cognitive load, and better align with individual traits like decision style and need for cognition. Building on these findings, we propose a multi-level explanation approach that combines upfront warnings with on-demand balanced details, adaptable across domains. Future work will explore personalization strategies, real-time adaptivity, and generalizability to domains such as media, news, and job recommendations. This research aims to inform the design of transparent, fair, and trustworthy explanation interfaces in recommender systems. | Elaheh Jafari | |||
| 192 | Challenges in Perfume Recommender Systems: Navigating Subjectivity, Context and Sensory Data | 0 | Compared to other recommender systems domains, perfume recommendation proves to be highly personalized and more challenging due to the very subjective factors and complex mixture of involved senses. Individual perfume preferences are influenced by subtle elements such as emotional associations, personal memories, and unique biochemistry, making it difficult for users to clearly express their olfactory preferences. This paper provides an insight of significant challenges in perfume recommendations planned to be addressed in the context of my ongoing PhD project. By exploring these areas, I aim to make a meaningful contribution to the ongoing development of perfume recommender systems. | ElenaRuxandra Lutan | |||
| 193 | Fair and Transparent Recommender Systems for Advertisements | 0 | Recommender systems are central to digital platforms, powering content personalization, user engagement, and revenue generation. In advertising, they operate within a multi-stakeholder environment, bringing together viewers, advertisers, and platform providers with often competing objectives. While such systems enhance targeting precision, their opacity raises concerns around fairness, transparency, and trust. This research, conducted in collaboration with RTL Netherlands, focuses on building fair and transparent recommender systems for advertisements, with particular emphasis on Video-on-Demand (VoD) platforms. I investigate algorithmic interventions and explainability techniques aimed at aligning system behavior with stakeholders’ expectations. By addressing tensions between stakeholders’ objectives and challenges of the ad delivery process, this work contributes to the design of ethically responsible advertising systems that balance commercial goals with accountability and user trust. | Dina Zilbershtein | |||
| 194 | Full-Page Recommender: A Modular Framework for Multi-Carousel Recommendations | 0 | Full-page layouts with multiple carousels are widely used in video streaming platforms, yet understudied in recommender systems research. This paper introduces a structured approach to generating such pages by recommending coherent item collections and optimizing their arrangement. We break the problem into subcomponents and propose methods that balance user relevance, diversity, and coherence. We also present an evaluation framework tailored to this setting. We argue that this approach can improve recommendation quality beyond traditional ranked lists. | Jan Kislinger | |||
| 195 | Narrative-Driven Itinerary Recommendation: LLM Integration for Immersive Urban Walking | 0 | Sedentary behavior, dubbed the disease of the 21st century, is a ubiquitous force driving chronic illness. Yet, traditional itinerary and Point-of-Interest (POI) Recommender Systems (RSs) lack engaging elements that motivate routine urban walking. This research proposes a novel framework combining narrative-driven storytelling with location-based RSs to promote physical activity and immersive urban exploration. This approach introduces a bidirectional alignment between POI and itinerary recommendations and LLM-generated narratives, transforming routine urban walks into dynamic journeys where contextually relevant stories unfold across city locations. Unlike sequential POI recommendations, this framework embeds location suggestions within contextually relevant narratives of various genres, simultaneously promoting health benefits and deeper city exploration. The research addresses three research questions using a method that builds a structured knowledge base by extracting entities (e.g., POIs, and characters) and semantic links from narrative corpora, enabling semantic alignment between recommended physical locations and story elements. The core aspects of this work are: (i) context-aware itinerary recommendations and personalized story generation, (ii) bidirectional mapping between RSs and story generation, and (iii) systems design bridging user’s needs to promote urban walking as a health activity. Evaluation employs comparative user studies measuring quality and engagement, route-narrative semantic alignment, and narrative analysis to validate the integrated proposed approach. | Fabio Ferrero | |||
| 196 | Item-centric Exploration for Cold Start Problem | 0 | Recommender systems face a critical challenge in the item cold-start problem, which limits content diversity and exacerbates popularity bias by struggling to recommend new items. While existing solutions often rely on auxiliary data, but this paper illuminates a distinct, yet equally pressing, issue stemming from the inherent user-centricity of many recommender systems. We argue that in environments with large and rapidly expanding item inventories, the traditional focus on finding the "best item for a user" can inadvertently obscure the ideal audience for nascent content. To counter this, we introduce the concept of item-centric recommendations, shifting the paradigm to identify the optimal users for new items. Our initial realization of this vision involves an item-centric control integrated into an exploration system. This control employs a Bayesian model with Beta distributions to assess candidate items based on a predicted balance between user satisfaction and the item's inherent quality. Empirical online evaluations reveal that this straightforward control markedly improves cold-start targeting efficacy, enhances user satisfaction with newly explored content, and significantly increases overall exploration efficiency. | Arnab Bhadury, Dong Wang, Junyi Jiao, Mingyan Gao, Onkar Dalal, Yaping Zhang | |||
| 197 | Cold Starting a New Content Type: A Case Study with Netflix Live | 0 | Industrial recommender systems often face challenges when personalizing content under an ever-changing, heterogeneous item catalog. With Netflix, for example, members can watch TV shows and movies on demand, play the latest games, or tune in to thrilling live events. The difficulty of recommending new items with limited historical interaction data is often referred to as "the cold start problem." This problem becomes exacerbated when an entirely new type of content is introduced into a recommender system, requiring the cold start of a new content type. The purpose of this work is to review an algorithmic approach we implemented at Netflix to efficiently cold-start live events. We validated this approach through a series of online experiments that resulted in increased live engagement (+20%) across Netflix’s global member base without negatively impacting core business metrics. | Anne Cocos, Christoph Kofler, Kriti Kohli, Mario GarcíaArmas, Mark Thornburg, Rob Saltiel, Vito Ostuni, Yunan Hu | |||
| 198 | Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music | 0 | Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces. | Aniruddh Nath, Bernardo Cunha, Chieh Lo, Gergo Varady, Jochen Klingenhoefer, Li Wei, Nikhil Khani, Shawn Andrews, Srivaths Ranganathan, Tim Steele, Yanwei Song | |||
| 199 | PAIRSAT: Integrating Preference-Based Signals for User Satisfaction Estimation in Dialogue Systems | 0 | User satisfaction estimation in dialogue systems is a fundamental measure for assessing and improving conversational-AI quality and user experience. Current approaches rely on users’ satisfaction annotations, referred to as supervised labels. Yet these labels are scarce, costly to collect, and often domain-specific. Another form of feedback arises when a user selects one of two offered responses in a conversation, usually called a preference signal. In this work, we propose PAIRSAT, a new model for user-satisfaction estimation that integrates both satisfaction labels and preference signals. We reformulate satisfaction prediction as a bounded regression task on a continuous scale, enabling fine-grained modeling of satisfaction levels. To exploit the preference data, we incorporate a pairwise ranking loss that encourages higher predicted satisfaction for accepted conversation responses over rejected ones. PAIRSAT jointly optimizes regression on labeled data and ranking on preference pairs using a Transformer-based encoder. Experiments demonstrate that our model outperforms baselines that rely solely on supervised satisfaction labels, demonstrating the value of adding preference signals. Further, our results underscore the value of leveraging additional signals for satisfaction estimation in dialogue systems. | Adir Solomon, Eran Fainman, Osnat Mokryn | |||
| 200 | PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform | 0 | User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications. | Charles Rosenberg, Hanyu Li, Haomiao Li, Jaewon Yang, Jiajing Xu, Kousik Rajesh, Matthew Lawhon, Pong Eksombatchai, Saurabh Vishwas Joshi, Xiangyi Chen, YiPing Hsu, Zelun Wang | |||
| 201 | Counterfactual Inference under Thompson Sampling | 0 | Olivier Jeunen | ||||
| 202 | Feedback-Driven Gradual Discovery for Expanding Musical Preferences | 0 | Alec Nonnemaker, Cynthia Liem, Ralvi Isufaj, Zoltán Szlávik | ||||
| 203 | The XITE Million Sessions Dataset | 0 | Ralvi Isufaj, Ruslan Tsygankov, Zoltán Szlávik | ||||
| 204 | Closing the Online-Offline Gap: A Scalable Framework for Composed Model Evaluation | 0 | Briac Marcatte, Brooke Bian, Chen Chen, Ellie Wen, Enriko Aryanto, Mahanth Kumar Beeraka, Mohamed A. Radwan, Mohsen Malmir, Tianshan Cui, Weikun Lyu, Wenjing Lu, Yang Li, Yining Lu | ||||
| 205 | SASRec in Action: Real-World Adaptations for ZDF Streaming Service | 0 | Andreas Grün, Sebastian Loth, Venkata Harshit Koneru, Xenija Neufeld | ||||
| 206 | Scaling Image Variant Optimization Through Customer Bucketing and Response Caching: A Large-Scale Implementation at Amazon Prime Video | 0 | Bobby Patel, Haiyun Jin | ||||
| 207 | Streaming Trends: A Low-Latency Platform for Dynamic Video Grouping and Trending Corpora Building | 0 | Ashkan Fard, CJ Carey, Caroline Zhou, Li Zhang, Mingyan Gao, Nikos Parotsidis, Qiao Zhang, Scott Wang, Sourabh Bansod, Yang Gu, Yaping Zhang, Yongzhe Wang | ||||
| 208 | Balancing Accuracy and Novelty with Sub-Item Popularity | 0 | Alberto Carlo Maria Mancino, Aleksandr V. Petrov, Chiara Mallamaci, Craig Macdonald, Tommaso Di Noia, Vito Walter Anelli | ||||
| 209 | Meta Off-Policy Estimation | 0 | Olivier Jeunen | ||||
| 210 | Mitigating Popularity Bias in Counterfactual Explanations using Large Language Models | 0 | Arjan Hasami, Masoud Mansoury | ||||
| 211 | Probabilistic Modeling, Learnability and Uncertainty Estimation for Interaction Prediction in Movie Rating Datasets | 0 | Antoine Ledent, Jennifer Poernomo, Nicole Gabrielle Lee Tan, Rodrigo Alves | ||||
| 212 | t-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing | 0 | Olivier Jeunen | ||||
| 213 | APS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection | 0 | Abdullah Abbas, Bart Goethals, Joeran Beel, Michael Heep, Theodor Sperle, Tobias Vente | ||||
| 214 | Multi-Armed Bandits in the Wild | 0 | Kim Falk | ||||
| 215 | Multi-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network | 0 | Jiaqi Chen, Ping Yang, Ruibo Ma, Weiqi Zhao, Xu Zhao, Yao Hu | ||||
| 216 | Prompt-to-Slate: Diffusion Models for Prompt-Conditioned Slate Generation | 0 | Elias Kalomiris, Federico Tomasi, Francesco Fabbri, Justin Carter, Mounia Lalmas, Zhenwen Dai | ||||
| 217 | Rethinking Overconfidence in VAEs: Can Label Smoothing Help? | 0 | WooSeong Yun, Yeojun Choi, YoonSik Cho | ||||
| 218 | Stairway to Fairness: Connecting Group and Individual Fairness | 0 | Christina Lioma, Falk Scholer, Maria Maistro, Theresia Veronika Rampisela, Tuukka Ruotsalo | ||||
| 219 | See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm-2K, and DBbook with Multimodal Data | 0 | Cataldo Musto, Elio Musacchio, Giovanni Semeraro, Giuseppe Spillo, Marco de Gemmis, Pasquale Lops | ||||
| 220 | Efficient Off-Policy Evaluation of Content Blending in Station-Based Music Experiences | 0 | Arvind Balasubramanian, Ben London, Chelsea Weaver, Juan Borgnino | ||||
| 221 | Kamae: Bridging Spark and Keras for Seamless ML Preprocessing | 0 | Daniele Donghi, George Barrowclough, James Shinner, Marian Andrecki | ||||
| 222 | Metadata Generation and Evaluation using LLMs - Case Study on Canonical Titles | 0 | Darren Edmonds, Sanja Simonovikj, Sinan Zhu, Yang Sun | ||||
| 223 | Never Miss an Episode: How LLMs are Powering Serial Content Discovery on YouTube | 0 | Aditee Kumthekar, Aditya Mahajan, Andrea Bettale, Li Wei, Mahesh Sathiamoorthy, Zrinka Puljiz | ||||
| 224 | Pareto-Optimal Solution: Optimizing Engagement and Revenue | 0 | Ankit Maheshwari, Maria Peifer, Neeraj Sharma, Sardar Hamidian, Shaghayegh Agah, Shaun Schaeffer | ||||
| 225 | Simulating Discoverability for Upcoming Content in TV Entertainment Platforms | 0 | Adeep Hande, Kishorekumar Sundararajan, Sardar Hamidian, Yidnekachew Endale | ||||
| 226 | Interactive Playlist Generation from Titles | 0 | Eléa Vellard, Enzo CharoloisPasqua, Pasquale Lisena, Raphaël Troncy, Youssra Rebboud |