Graph Neural Networks (GNNs) have shown impressive performance in various domains. Motivated by this success, several GNN-based session-based recommender systems (SBRS) have been proposed over the past few years. The literature suggests that these algorithms can achieve strong performance and outperform well established baseline neural models. However, some recent reproducibility studies suggest that the performance achieved by more complex GNN-based models may sometimes be overstated and that these models may not be as impactful as expected. Moreover, an inconsistent choice of datasets, preprocessing steps, and evaluation protocols across published works makes it difficult to reliably assess progress in the field. In this present study, we reassess the performance of three well-established baseline models—GRU4Rec, NARM, and STAMP—and compare them to six more recent GNN-based SBRS within a standardized evaluation framework. Experiments on commonly used datasets for SBRS reveal that in particular the GRU4Rec model, if properly tuned, is still highly competitive and leads to the best results on two out of three datasets. Furthermore, we find that the performance of the GNN-based models varies largely across datasets. Interestingly, only the quite early SR-GNN model turns out to be superior in terms of accuracy metrics on one of the datasets. Overall, we speculate that the reasons for our surprising result may lie in insufficient hyperparameter tuning processes for the baselines in the original papers.

GitHub: Source code and datasets