Algorithmic Contract Design for Crowdsourced Ranking: An Introduction to Get You Started

1 May 2024

This paper is available on Arxiv under CC 4.0 license.


(1) Kiriaki Frangias;

(2) Andrew Lin;

(3) Ellen Vitercik;

(4) Manolis Zampetakis.

Abstract and Introduction

Warm-up: Agents with known equal disutility

Agents with unknown disutilites


Conclusions and future directions, References

A Summary of notation

B Omitted proofs from Section 2

C Omitted proofs from Section 3

D Additional Information about Experiments


Ranking is fundamental to many areas, such as search engine and recommender system optimization, as well as peer grading. Crowdsourcing, which is often used for these tasks, requires proper incentivization to ensure accurate inputs. In this work, we draw on the field of contract theory from Economics to propose a novel mechanism that enables a principal to accurately rank a set of items by incentivizing agents to provide pairwise comparisons of the items.

Our mechanism implements these incentives by verifying a subset of each agent’s comparisons with a ground-truth ordering, a task we assume to be costly. The agent is compensated (for example, monetarily or with class credit) based on the accuracy of these comparisons. Our mechanism achieves the following guarantees: (1) it only requires the principal to verify O(log s) comparisons, where s is the total number of agents, and (2) it provably achieves higher total utility for principal compared to ranking the items herself with no crowdsourcing.

1 Introduction

Ranking is a central task across a wide range of applications including search engine optimization, recommender systems, and peer grading: search engines rank pages by relevance, recommender systems rank content by entertainment value, and peer graders rank submissions by quality. Platforms often use crowdsourcing to determine rankings [7, 9, 15], especially when ranking many alternatives. In many such cases, it is often more feasible and accurate for agents to provide pairwise comparisons than cardinal evaluations [26, 27].

However, rational agents will only invest the effort necessary to provide accurate comparisons when properly incentivized. In many real world scenarios, such as peer grading, these incentives involve a principal (the ranking’s end-user) verifying a selection of an agent’s comparisons against a ground-truth sorting. The agent’s payment depends on these verified comparisons’ accuracy.

We propose a novel mechanism that enables a principal to rank a set of items from pairwise comparisons while minimizing the requisite payments and number of verifications, which are costly for the principal. To model the principal and agents, we draw on contract theory, a classical field in economics [16] that has become increasingly relevant in computer science [e.g., 6, 10, 14, 32]. The principal assigns comparisons to each agent and verifies a subset of these comparisons against a ground-truth ordering. The agent does not know the subset’s identity. The principal will compensate each agent for correctly evaluating all verified pairs.

Given the principal’s offered payment, the agent decides whether they will exert effort. As in the standard model of contract theory, if the agent exerts effort, they will be “good” at the task with some probability and “bad” at the task with some smaller probability. If they decide not to exert effort, they will be “bad” with certainty. A “good” agent will return the correct comparisons, but a “bad” agent will not. When modeling the agents, we introduce heterogeneity through differing values of the per-comparison disutility that they experience. We therefore assume that every agent can potentially be “good”, but might require more or less compensation to be incentivized to exert effort. As we show, this also allows the principal to incentivize and pay only agents with small disutilities, even though these values are not a priori known to the principal.

Key challenges. The primary obstacle we face in designing a mechanism for this problem lies in determining how to distribute the comparisons among the agents while satisfying several critical requirements:

(a) The number of items assigned to each agent (i.e., the subset of items that make up their pairwise comparisons) should be small, minimizing the agent’s workload.

(b) Every comparison must be evaluated by at least one “good” agent with high probability.

In other words, we want each agent’s comparisons to cover as few items as possible to satisfy (a). However, we need the union of all agents’ comparisons to have sufficiently high redundancy to satisfy (b). These conditions are fundamentally at odds, so satisfying both requires care.

1.1 Our contributions

We summarize our four primary contributions below.

2. Aggregating the noisy comparisons. Next, we present an algorithm for identifying “bad” agents and aggregating comparisons from “good” agents. Our algorithm returns the correct ranking with high probability.

3. Determining payments. We identify the payments necessary to ensure enough agents are “good” to recover the ground-truth sorting. This depends on the disutility it costs agents to make comparisons. We study a range of settings, from a warm-up where the principal knows the disutilities (as is typical in contract theory), to a setting where the disutilities are drawn from a distribution about which the principal has limited information.

4. Principal’s utility. We show that it suffices for the principal to verify O(log s) comparisons to ensure that the ground-truth sorting is recovered. We also identify natural conditions under which, by implementing our proposed contract, the principal achieves higher total utility than she would if she performed all pairwise comparisons herself.

5. Experiments. In experiments, we stress-test our results by relaxing our model along several axes. We demonstrate that even when the assumed model is inaccurate, with a sufficient increase in the payment the principal can recover the ground-truth ranking.

Rank Aggregation. Ranking via pairwise comparisons can be considered as a subfield of rank aggregation [2, 19, 22, 28]. Recent work on rank aggregation with pairwise comparisons has built upon and extended these techniques [4, 5, 8, 18, 30]. Lin [17] provides an extensive introduction. Rank aggregation differs from our work, which models the data sources as agents with differing reliability.

Crowdsourcing. Crowdsourcing offers a cost-effective and efficient way to consolidate feedback from a large number of non-experts. With the rise of crowdsourcing platforms, such as Amazon Mechanical Turk, crowdsourcing has been commonly used in a variety of applications, including sorting [1, 4, 25]. A central task in crowdsourcing is modeling the reliability of annotators [4, 25]. Chen et al. [4], for example, use the Bradley-Terry model for ranking aggregation, assuming that annotators are either perfect or spammers (comparisons are correct with probability 0.5). In contrast, we study a dynamic setting in which annotators can be incentivized to exert effort and therefore increase the accuracy of their work.

Ordinal peer grading. This area is most related to our work. With the rise of massive online open courses (MOOCs), it has become increasingly important to decentralize the grading process in an efficient manner that comes at a low cost to teaching staff [21]. Peer grading can be a solution to this technical hurdle. Ordinal peer grading (in which students are asked to rank other students’ work) is an inexpensive, accurate, and robust way to assess performance [27].

However, many works on ordinal peer grading draw on probabilistic methods to estimate grader reliability and lack perspective on strategic interactions [20, 23, 24]. In contrast, works relating to aggregating feedback from strategic data sources focus mostly on regression or cardinal evaluations of items [3, 11–13, 31]. Our work aims to bridge this gap using a contract between the principal and the agents. The contract structures their interaction and leverages their strategic behavior by incentivizing agents to perform accurate comparisons. It also comes at a low cost to the principal while provably retrieving the correct ranking of items.

This paper is available on Arxiv under CC 4.0 license.