Datasets

June 4, 2026 · View on GitHub

For easy experimentation, Cornac offers access to a number of popular recommendation benchmark datasets. These are listed below along with their basic characteristics, followed by a usage example. In addition to preference feedback, some of these datasets come with item and/or user auxiliary information, which are grouped into three main categories:

Text refers to textual information associated with items or users. The usual format of this data is (item_id, text), or (user_id, text). Concrete examples of such information are item textual descriptions, product reviews, movie plots, and user reviews, just to name a few.
Graph, for items, corresponds to a network where nodes (or vertices) are items, and links (or edges) represent relations among items. This information is typically represented by an adjacency matrix in the sparse triplet format: (item_id, item_id, weight), or simply (item_id, item_id) in the case of unweighted edges. Relations between users (e.g., social network) are represented similarly.
Image consists of visual information paired with either users or items. The common format for this type of auxiliary data is (object_id, ndarray), where object_id could be one of user_id or item_id, the ndarray may contain the raw images (pixel intensities), or some visual feature vectors extracted from the images, e.g., using deep neural nets. For instance, the Amazon clothing dataset includes product CNN visual features.

How to cite. If you are using one of the datasets listed below in your research, please follow the citation guidelines by the authors (the "source" link below) of each respective dataset.

Dataset	Preference Info.				Item Auxiliary Info.			User Auxiliary Info.
Dataset	#Users	#Items	#Interactions	Type	Text	Graph	Image	Graph
Amazon Clothing (source)	5,377	3,393	13,689	INT [1,5]	✔	✔	✔
Amazon Digital Music (source)	5,541	3,568	64,706	INT [1,5]	✔
Amazon Office (source)	3,703	6,523	53,282	INT [1,5]		✔
Amazon Toy (source)	19,412	11,924	167,597	INT [1,5]
Citeulike (source)	5,551	16,980	210,537	BIN {0,1}	✔
Epinions (source)	40,163	139,738	664,824	INT [1,5]				✔
FilmTrust (source)	1,508	2,071	35,497	REAL [0.5,4]				✔
MovieLens 100k (source)	943	1,682	100,000	INT [1,5]	✔
MovieLens 1M (source)	6,040	3,706	1,000,209	INT [1,5]	✔
MovieLens 10M (source)	69,878	10,677	10,000,054	INT [1,5]	✔
MovieLens 20M (source)	138,493	26,744	20,000,263	INT [1,5]	✔
Netflix Small (source)	10,000	5,000	607,803	INT [1,5]
Neflix Original (source)	480,189	17,770	100,480,507	INT [1,5]
Tradesy (source)	19,243	165,906	394,421	BIN {0,1}			✔

Usage example

Assume that we are interested in the FilmTrust dataset, which comes with both user-item ratings and user-user trust information. We can load these two pieces of information as follows,

from cornac.datasets import filmtrust

ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()

The ranting values are in the range [0.5,4], and the trust network is undirected. Here are samples from our dataset,

Samples from ratings: [('1', '1', 2.0), ('1', '2', 4.0), ('1', '3', 3.5)]
Samples from trust: [('2', '966', 1.0), ('2', '104', 1.0), ('5', '1509', 1.0)]

Our dataset is now ready to use for model training and evaluation. A concrete example is sorec_filmtrust, which illustrates how to perform an experiment with the SoRec model on FilmTrust. More details regarding the other datasets are available in the documentation.

Next-Basket Datasets

Dataset	Preference Info.				Extra Info.
Dataset	#Users	#Items	#Baskets	#Interactions	Extra Info.
Ta Feng (source)	28,297	22,542	86,403	817,741	price, quantity

Next-Item Datasets

Dataset	Users	#Items	#Sessions	#Interactions	Extra Info.
Gowalla (source)	107,092	1,280,969	2,710,119	6,442,892	Check-ins location (longitude, latitude)
YooChoose (buy) (source)	N/A	19,949	509,696	1,150,753	N/A
YooChoose (click)	N/A	52,739	9,249,729	33,003,944	N/A
YooChoose (test)	N/A	42,155	2,312,432	8,251,791	N/A

Session-aware Datasets

Session-aware recommendation extends next-item (session-based) recommendation by associating sessions with identified users. While next-item datasets rely on session-level sequences (e.g., SIT format), session-aware datasets incorporate user identities (e.g., USIT format), allowing models to capture both long-term user preferences across multiple sessions and short-term session-level dynamics.

Dataset	#Users	#Items	#Sessions	#Interactions	#Sessions per User	#Interactions per Item	#Interactions per Session	Density
Diginetica	571	6,008	2,670	12,146	4.68	2.02	4.55	0.354%
RetailRocket	4,249	36,658	24,732	230,817	5.82	6.30	9.33	0.148%
Cosmetics	17,268	42,367	172,242	2,533,262	9.97	59.79	14.71	0.346%

For session-based (next-item) evaluation, Diginetica's load_val() and load_test() default to mode="session-based", returning each user's single held-out session (val_sbr/test_sbr) with no training transitions repeated — the clean evaluation set used by session-based models such as FPMC and GRU4Rec. Pass mode="session-aware" to load the cumulative files (val/test) instead, where each user's prior sessions precede their held-out one for cross-session models.