Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach

Abstract Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing avail...

Full description

Saved in:
Bibliographic Details
Main Authors: Ting Li, Lewen Mi, Xiangyu Meng, Yongju Jia, Lin Zhao, Qi Zhao, Zihao Wei, Guandong Gao, Xiangxian Li
Format: Article
Language:English
Published: SpringerOpen 2025-02-01
Series:EPJ Data Science
Subjects:
Online Access:https://doi.org/10.1140/epjds/s13688-025-00533-1
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing availability of judicial data, such as legal documents, machine learning approaches hold great potential in this domain. However, a significant challenge is the lack of comprehensive datasets to train these models, and the distribution of crime motive categories in publicly available legal texts often exhibits a long-tailed imbalance. This imbalance can lead to model bias, where the model tends to predict more common criminal motives. To address these challenges, we collected 11,589 legal documents from China Judgements Online (2019–2024) to create a crime motive text dataset. To mitigate the long-tailed issue, we propose a Category-Aware Balanced Contrastive Learning (CA-BCL) method, which effectively enhances the model’s representation of long-tailed data. Specifically, CA-BCL first balances the sampling process to alleviate the class imbalance during prototype construction and then applies balanced contrastive learning to improve the model’s ability to generalize to long-tailed categories, leading to better overall classification performance. Our experimental results demonstrate that CA-BCL significantly outperforms existing text classification models in crime motive classification, while also showing strong generalization capabilities on standard text classification benchmark.
ISSN:2193-1127