Probabilistic Machine Learning for Annotations

Crowdsourced Data Labeling: Enhancing Machine Learning with Annotator Expertise.

This project proposes a probabilistic model for learning from crowdsourced data by accounting for variability in annotator expertise. It aims to improve predictive accuracy in noisy labeling environments by modeling annotator performance based on input data and true labels.

The project uses Expectation-Maximization algorithms with the following approaches:

Base algorithm: Utilizes logistic regression with LBFGS optimization to estimate true labels while modeling annotator accuracy probabilistically.
Weighted logistic regression: Introduces LASSO regularization to account for annotator expertise and prevent overfitting.
Decision trees: Considered as an alternative but not fully implemented, due to logistic regression’s better fit for the dataset.

The dataset consists of radar data from Johns Hopkins University, labeled as good (g) or bad (b) by multiple annotators.

Findings

Weighted logistic regression with LASSO effectively balances bias and variance, leading to better generalization.
Increasing the number of annotators enhances model performance.
Logistic regression is more computationally efficient than decision trees for this dataset.
AUC scores improved with more annotators, confirming the benefit of aggregating diverse inputs:
- Base algorithm → AUC of 0.881 with 100 annotators.
- Weighted logistic regression with LASSO → AUC of 0.895 with 100 annotators.

The project presents a robust framework for handling noisy and inconsistent annotations in crowdsourced data, making it valuable for applications like: medical image classification, NLP sentiment analysis, and social science surveys.

It demonstrates how regularization and probabilistic modeling can effectively manage variability in annotator expertise.

Link to the Report