Fairness in recommender systems

Impact leads: Theresia Veronika Rampisela, Maria Maistro, and Christina Lioma, University of Copenhagen

Subproject: Machine learning and bias

Introduction to the case by assistant prof. Maria Maistro

Context

Recommender systems are applications that match items such as services or products, to individuals. The goal is for users to interact (click, view, purchase, etc.) with the items recommended by the systems. They are not only part of common everyday applications, such as e-commerce, entertainment websites, and social media, but also high-risk applications, including job recommendations and health. For example, job seekers on a job application platform may miss out on employment opportunities because they are frequently placed at the bottom of the rankings shown to recruiters. It is thus fundamental to ensure that these systems provide fair treatment, avoiding discrimination against specific groups or individuals. Fairness of AI systems, such as recommender systems, has become increasingly important and even compulsory for high-risk applications, according to recent legislation, e.g., the EU AI Act.

To accurately quantify the fairness of these systems, we need robust evaluation measures that are applicable in a wide range of cases and easy to interpret. This subproject at the Department of Computer Science at the University of Copenhagen (DIKU) focuses on analysing and developing evaluation measures to assess the level of fairness of recommender systems. This is necessary for practitioners, both from research and industry, to detect cases of unfair treatment and develop mitigation strategies. Specifically, this subproject involved researchers from DIKU and the Royal Melbourne Institute of Technology in Australia, and used data from well-known providers of recommendation systems, such as Amazon and Yelp.

To investigate fairness in recommender systems, the project addressed the following questions:

What are the strengths and limitations of the current evaluation measures of fairness in recommender systems?
How do we overcome the limitations of the current evaluation measures of fairness?
How does increasing fairness affect effectiveness (e.g. accuracy) in recommender systems?
Which evaluation measures should we use to evaluate fairness in recommender systems?

The study

To evaluate fairness in recommender systems, the research team has used real-world datasets provided by tech companies such as Amazon, including ratings given by users to products or content, or click logs. These kinds of data were used as input to AI models, which generated a ranked list of personalised recommendations based on the user interaction history. Then, the research team: a) evaluated the theoretical and empirical properties of existing fairness measures, for example, by analysing their mathematical formulations, computing the agreement among measures, and performing stress tests to assess their robustness; and b) created new improved evaluation measures of fairness. This work received the prestigious best journal paper award from Women in RecSys and was presented at the 18^th ACM Conference on Recommender Systems (RecSys 2024).

Some of the key findings and contributions to the field of recommender systems are as follows:

Several existing evaluation measures of fairness in recommender systems are unreliable because they are mathematically flawed and often conflict with each other in what they deem fair. This is problematic because, for instance, the measures cannot even be computed for some data, or they can return misleading results, e.g., a system being assessed as fair when in practice, it is not.
The research team corrected the above flaws in existing evaluation measures. This resulted in making the evaluation measures more reliable, more usable, and more interpretable to humans. The research team’s reformulated evaluation measures of fairness can now be used across a wider range of evaluation scenarios than the original evaluation measures.
Increasing fairness sometimes comes at the cost of reduced effectiveness, e.g. less precise recommendations. This is a real-life problem, because practitioners need technology that is both accurate and fair. To address this problem, the research team created a tool that quantifies the trade-off between recommendation fairness and effectiveness, so that fairness and effectiveness can be jointly measured. This allows practitioners to calibrate their recommendation technologies without compromising accuracy or fairness, while also ensuring that users of this technology receive recommendations that balance accuracy and fairness.

Collaboration with the Royal Melbourne Institute of Technology

As part of her work on the project of fairness in recommender systems, ADD Ph.D. student Theresia Veronika Rampisela visited the Royal Melbourne Institute of Technology in Australia and collaborated with Professor Falk Scholer at the ARC Centre of Excellence for Automated Decision-Making and Society. Prof. Scholer provided expertise on evaluation measures, statistical testing, and human-computer interaction. This collaboration resulted in a new approach to evaluate recommendation fairness for individuals but also for groups of users, an area that had not received much attention until then, making this contribution particularly impactful.

Pathways to impact

The research has generated impact mainly through “participation by product” in the form of scientific outputs, such as journal articles and conference papers at top international venues in the recommender systems field. In addition, the research has been communicated to the wider audience through contributions to the printed media and social media, such as opinion pieces, podcasts, and presentations, both in Denmark and internationally.

Collaboration with internal and external peers

The researchers have been very active in in-person participation in the world’s elite scientific events, where the state of the art research by leading industry and academia is presented every year. The work has been disseminated through presentations at top international conferences, as well as accepted invitations for talks and seminars with internal and external peers, for instance at universities in Denmark, Sweden, and the UK.

The research team has released to the public all programming code of the tools and evaluation measures developed in this project. This allows other scientists and technologists, not only to verify the robustness of these tools, but also to reuse them and adapt them to their needs. All the knowledge produced in this project has therefore been communicated back to the broad public, not only in scientific publications, but also as ready to use tools, free for anyone, accompanied by guidelines for practitioners.

Guidelines for selecting fairness evaluation measures

There exist many fairness evaluation measures for recommender systems, making it difficult for people to choose the correct evaluation measure for a specific purpose. Therefore, one outcome of this project was to develop guidelines to help researchers and practioners to determine which evaluation measures of fairness to use and how to use them.For instance, these guidelines specify:

Which evaluation measures should be used when assessing the fairness of an individual recommender system, and which should be used when comparing the fairness of several recommender systems to each other. Ignoring this guideline can lead to gross misestimation of fairness by practitioners.
How humans should interpret the scores of fairness evaluation measures. Different evaluation measures can be more or less generous, i.e. some of them can give mostly high scores, while other measures can give mostly low scores. This should be taken into account when a human tries to understand what a score of 90% fairness means for a system, for example. A score like this does not mean that the recommendations of that system are almost perfect, if the evaluation measure that produced this score tends to generally give scores above 85% for most recommendations.
When interested in assessing both the fairness and the accuracy of a recommendation system, the tool developed by the research team, Distance to Pareto Frontier (DPFR), should be used, because it considers the fairness-effectiveness trade-off, while most existing evaluation measures tend to favour either relevance or fairness.
Both group and individual fairness should be evaluated, because they do not always agree. A recommender system can be fair to a group, but at the same time highly unfair to individuals within that group. For example, a recommender system might be fair across different genders. But when we consider a specific group, for example, women, the model may still discriminate against women based on ethnicity.

By providing guidelines to stakeholders to apply in selecting fairness evaluation measures, the research team has laid the foundation for future impact for a diverse set of key stakeholders as the guidelines for fairness in recommender systems can be applied by both consumers and providers of recommender systems as well as researchers within the field of study, system developers, companies or institutions that utilise these systems, or regulators, ethical experts and legislators who must understand this technology. These stakeholders can all benefit from the research by following the best practices in fairness evaluation and using the tools produced in this project. For instance, policymakers can use the results in this project to set measurable fairness criteria for high-risk applications, based on easily interpretable fairness measures. In this way, the project’s results can benefit a diverse set of key stakeholders, all of whom engage with or work with recommender systems in one way or another.

Communication of findings and recommendations in different fora

An important part of the research team’s work for communicating findings and recommendations has been through presentations at conferences and guest lectures around the world, from Australia to Europe. Asia and the United States, underlining the importance and relevance of the project in the field of recommender systems. In addition to the impact achieved by presenting the work to top research conferences and publishing journal articles, the research team has also communicated broadly in the printed and social media. For instance, Theresia Rampisela collaborated with a podcast series and also published an online article on the ADD-project’s website discussing fairness in recommender systems, where she defines what fairness in recommender systems means: “What do I mean by fairness? (…) Unfortunately, there is no single definition for this, but it is about ensuring users are treated without discrimination”. Results of this project have also been communicated in a keynote talk by Christina Lioma at Infosecurity Expo in Copenhagen, as well as opinion pieces and interviews at Ingeniøren Version 2, videnskab.dk, tjekdet.dk, and Kristeligt Dagblad, among others.

Conclusion

This project makes contributions to the area of measuring the fairness of automated recommendations. Being able to measure the fairness of automated recommendations is vital to ensuring that technology serves all segments of society equitably. As these systems increasingly influence decisions in areas like hiring, lending, healthcare, and criminal justice, establishing clear fairness metrics is essential not only for ethical accountability, but also to inform and support emerging legislation that protects against systemic bias and discrimination. This work therefore contributes to shaping fairer digital ecosystems and guiding policy that safeguards civil rights in an algorithm-driven world.

Before this project, it was simply not possible to reliably measure how fair an automated recommendation is. A few evaluation measures of recommendation fairness existed, but no one knew how well they worked or how to interpret their scores. This project uncovered critical, previously unrecognized limitations that call into question the validity of widely used evaluation approaches and addressed them by introducing corrections and building new robust fairness evaluation tools that have set a new standard for the field. These tools are state of the art, and they are freely available to all at https://github.com/orgs/diku-irlab/repositories?q=fairness.

References

Rampisela, T. V., Maistro, M., Ruotsalo, T., & Lioma, C. (In press). Measuring Individual User Fairness with User Similarity and Effectiveness Disparity. 1. 48th European Conference on Information Retrieval (ECIR 2026), Delft, Netherlands.

Rampisela, T. V., Maistro, M., Ruotsalo, T., Scholer, F., & Lioma, C. (2025). Stairway to Fairness: Connecting Group and Individual Fairness. I RecSys ’25: Proceedings of the Nineteenth ACM Conference on Recommender Systems (s. 677-683). Association for Computing Machinery, Inc. https://doi.org/10.1145/3705328.3748031

Rampisela, T. V., Maistro, M., Ruotsalo, T., Scholer, F., & Lioma, C. (2025). Relevance-aware Individual Item Fairness Measures for Recommender Systems: Limitations and Usage Guidelines. ACM Transactions on Recommender Systems. https://doi.org/10.1145/3765624

Rampisela, T. V., Ruotsalo, T., Maistro, M., & Lioma, C. (2025). Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier. WWW ’25: Proceedings of the ACM on Web Conference 2025 (s. 1548-1566). Association for Computing Machinery, Inc. https://doi.org/10.1145/3696410.3714589

Rampisela, T. V., Ruotsalo, T., Maistro, M., & Lioma, C. (2024). Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance. I SIGIR 2024 – Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (s. 271-281). Association for Computing Machinery, Inc. https://doi.org/10.1145/3626772.3657832

Rampisela, T. V., Maistro, M., Ruotsalo, T., & Lioma, C. (2024). Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study. ACM Transactions on Recommender Systems, 3(2), Artikel 18. https://doi.org/10.1145/3631943

Rampisela, T. V., & Marjanovic, S. V. (2024). Can you trust algorithms to be fair? New research says ‘it depends’. https://algorithms.dk/2024/11/18/can-you-trust-algorithms-to-be-fair-new-research-says-it-depends/

Presentations

Rampisela, T. V. (2026, Mar 2). Connecting Group Fairness and Individual Fairness in Recommender Systems. Online talk at University of Glasgow, Scotland, United Kingdom https://samoa.dcs.gla.ac.uk/events/viewtalk.jsp?id=20555

Maistro, M. (2024, May 27). Evaluation beyond relevance: accounting for credible, correct and fair information. Talk at the Search Engines course, A.Y. 2023/2024, University of Padua, Padua, Italy.

Rampisela, T. V. (2024, July 14–18). Can we trust recommender system fairness evaluation? The role of fairness and relevance. ACM SIGIR 2024, Washington D.C., USA.