Prediction of Monetary Penalties for Data Protection Cases in Multiple Languages
Aaron Ceross and Tingting Zhu
As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances. In this paper, we evaluate machine learning models in the literature (such as Support Vector Machine (SVM), Random Forest, and Multinomial Naive Bayes (MNB) classifiers) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel data set collected from the data protection authority of Macao across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show that the machine learning models provide the necessary predictability in order to automate the evaluation of data protection cases. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, respectively. We further evaluated the interpretability of the results independently for each of the languages and found that the salient texts that were identified are shared across the three languages.