Datasets
June 4, 2026 · View on GitHub
For easy experimentation, Cornac offers access to a number of popular recommendation benchmark datasets. These are listed below along with their basic characteristics, followed by a usage example. In addition to preference feedback, some of these datasets come with item and/or user auxiliary information, which are grouped into three main categories:
- Text refers to textual information associated with items or users. The usual format of this data is
(item_id, text), or(user_id, text). Concrete examples of such information are item textual descriptions, product reviews, movie plots, and user reviews, just to name a few. - Graph, for items, corresponds to a network where nodes (or vertices) are items, and links (or edges) represent relations among items. This information is typically represented by an adjacency matrix in the sparse triplet format:
(item_id, item_id, weight), or simply(item_id, item_id)in the case of unweighted edges. Relations between users (e.g., social network) are represented similarly. - Image consists of visual information paired with either users or items. The common format for this type of auxiliary data is
(object_id, ndarray), whereobject_idcould be one ofuser_idoritem_id, thendarraymay contain the raw images (pixel intensities), or some visual feature vectors extracted from the images, e.g., using deep neural nets. For instance, the Amazon clothing dataset includes product CNN visual features.
How to cite. If you are using one of the datasets listed below in your research, please follow the citation guidelines by the authors (the "source" link below) of each respective dataset.
| Dataset | Preference Info. | Item Auxiliary Info. | User Auxiliary Info. | |||||
|---|---|---|---|---|---|---|---|---|
| #Users | #Items | #Interactions | Type | Text | Graph | Image | Graph | |
| Amazon Clothing (source) |
5,377 | 3,393 | 13,689 | INT [1,5] |
✔ | ✔ | ✔ | |
| Amazon Digital Music (source) |
5,541 | 3,568 | 64,706 | INT [1,5] |
✔ | |||
| Amazon Office (source) |
3,703 | 6,523 | 53,282 | INT [1,5] |
✔ | |||
| Amazon Toy (source) |
19,412 | 11,924 | 167,597 | INT [1,5] |
||||
| Citeulike (source) |
5,551 | 16,980 | 210,537 | BIN {0,1} |
✔ | |||
| Epinions (source) |
40,163 | 139,738 | 664,824 | INT [1,5] |
✔ | |||
| FilmTrust (source) |
1,508 | 2,071 | 35,497 | REAL [0.5,4] |
✔ | |||
| MovieLens 100k (source) |
943 | 1,682 | 100,000 | INT [1,5] |
✔ | |||
| MovieLens 1M (source) |
6,040 | 3,706 | 1,000,209 | INT [1,5] |
✔ | |||
| MovieLens 10M (source) |
69,878 | 10,677 | 10,000,054 | INT [1,5] |
✔ | |||
| MovieLens 20M (source) |
138,493 | 26,744 | 20,000,263 | INT [1,5] |
✔ | |||
| Netflix Small (source) |
10,000 | 5,000 | 607,803 | INT [1,5] |
||||
| Neflix Original (source) |
480,189 | 17,770 | 100,480,507 | INT [1,5] |
||||
| Tradesy (source) |
19,243 | 165,906 | 394,421 | BIN {0,1} |
✔ | |||
Usage example
Assume that we are interested in the FilmTrust dataset, which comes with both user-item ratings and user-user trust information. We can load these two pieces of information as follows,
from cornac.datasets import filmtrust
ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()
The ranting values are in the range [0.5,4], and the trust network is undirected. Here are samples from our dataset,
Samples from ratings: [('1', '1', 2.0), ('1', '2', 4.0), ('1', '3', 3.5)]
Samples from trust: [('2', '966', 1.0), ('2', '104', 1.0), ('5', '1509', 1.0)]
Our dataset is now ready to use for model training and evaluation. A concrete example is sorec_filmtrust, which illustrates how to perform an experiment with the SoRec model on FilmTrust. More details regarding the other datasets are available in the documentation.
Next-Basket Datasets
| Dataset | Preference Info. | Extra Info. | |||
|---|---|---|---|---|---|
| #Users | #Items | #Baskets | #Interactions | ||
| Ta Feng (source) |
28,297 | 22,542 | 86,403 | 817,741 | price, quantity |
Next-Item Datasets
| Dataset | Users | #Items | #Sessions | #Interactions | Extra Info. |
|---|---|---|---|---|---|
| Gowalla (source) |
107,092 | 1,280,969 | 2,710,119 | 6,442,892 | Check-ins location (longitude, latitude) |
| YooChoose (buy) (source) |
N/A | 19,949 | 509,696 | 1,150,753 | N/A |
| YooChoose (click) | N/A | 52,739 | 9,249,729 | 33,003,944 | N/A |
| YooChoose (test) | N/A | 42,155 | 2,312,432 | 8,251,791 | N/A |
Session-aware Datasets
Session-aware recommendation extends next-item (session-based) recommendation by associating sessions with identified users. While next-item datasets rely on session-level sequences (e.g., SIT format), session-aware datasets incorporate user identities (e.g., USIT format), allowing models to capture both long-term user preferences across multiple sessions and short-term session-level dynamics.
| Dataset | #Users | #Items | #Sessions | #Interactions | #Sessions per User | #Interactions per Item | #Interactions per Session | Density |
|---|---|---|---|---|---|---|---|---|
| Diginetica | 571 | 6,008 | 2,670 | 12,146 | 4.68 | 2.02 | 4.55 | 0.354% |
| RetailRocket | 4,249 | 36,658 | 24,732 | 230,817 | 5.82 | 6.30 | 9.33 | 0.148% |
| Cosmetics | 17,268 | 42,367 | 172,242 | 2,533,262 | 9.97 | 59.79 | 14.71 | 0.346% |
For session-based (next-item) evaluation, Diginetica's load_val() and load_test() default to mode="session-based", returning each user's single held-out session (val_sbr/test_sbr) with no training transitions repeated — the clean evaluation set used by session-based models such as FPMC and GRU4Rec. Pass mode="session-aware" to load the cumulative files (val/test) instead, where each user's prior sessions precede their held-out one for cross-session models.