Comparison of data augmentation methods for legal document classification

Gergely Csányi; Tamás Orosz

doi:10.14513/actatechjaur.00628

Authors

Gergely Csányi MONTANA Knowledge Management Ltd, Hársalja Str. 32., H-1029, Budapest, Hungary // Budapest University of Technology and Economics, Department of Electric Power Engineering, J. Egry Str. 18., H-1111, Budapest, Hungary https://orcid.org/0000-0001-8475-5969
Tamás Orosz MONTANA Knowledge Management Ltd, Hársalja Str. 32., H-1029, Budapest, Hungary // University of West Bohemia, Univerzitni 26, 306 14 Pilsen, Czech Republic https://orcid.org/0000-0002-8743-3989

DOI:

https://doi.org/10.14513/actatechjaur.00628

Keywords:

text augmentation, augmenting legal cases, legal document classification, data augmentation

Abstract

Sorting out the legal documents by their subject matter is an essential and time-consuming task due to the large amount of data. Many machine learning-based text categorization methods exist, which can resolve this problem. However, these algorithms can not perform well if they do not have enough training data for every category. Text augmentation can resolve this problem. Data augmentation is a widely used technique in machine learning applications, especially in computer vision. Textual data has different characteristics than images, so different solutions must be applied when the need for data augmentation arises. However, the type and different characteristics of the textual data or the task itself may reduce the number of methods that could be applied in a certain scenario. This paper focuses on text augmentation methods that could be applied to legal documents when classifying them into specific groups of subject matters.

Downloads

Download data is not yet available.

References

E. Comission, Digitalisation of justice,URL https://ec.europa.eu/info/policies/justice-and-fundamental-rights/digitalisation-justice_en, (accessed: 01.02.2021).

E. Hyvonen, M. Tamper, E. Ikkala, S. Sarsa, A. Oksanen, J. Tuomi-nen, A. Hietanen, Lawsampo: a semantic portal on a linked open dataservice for finnish legislation and case law, in: Proceedings of ESWC, 2020.

Pseudonymization according to the gdpr [definitions and examples], URL https://dataprivacymanager.net/pseudonymization-according-to-the-gdpr/, (accessed: 02.15.2021).

A. Dhar, H. Mukherjee, N. S. Dash, K. Roy, Text categorization: past andpresent, Artificial Intelligence Review 54 (4) (2021) 3007–3054.

P. Jackson, I. Moulinier, Natural language processing for online applications:Text retrieval, extraction and categorization, Vol. 5, John Benjamins Publishing,2007.

P. J. Hayes, L. E. Knecht, M. J. Cellio, A news story categorization system, in:Second Conference on Applied Natural Language Processing, 1988, pp. 9–17.

R. B. Pereira, A. Plastino, B. Zadrozny, L. H. Merschmann, Categorizing featureselection methods for multi-label classification, Artificial Intelligence Review49 (1) (2018) 57–78.

M.-L. Zhang, Z.-H. Zhou, Ml-knn: A lazy learning approach to multi-labellearning, Pattern recognition 40 (7) (2007) 2038–2048.

G. E. Tsekouras, C. Anagnostopoulos, D. Gavalas, E. Dafhi, Classification ofweb documents using fuzzy logic categorical data clustering, in: IFIP Inter-national Conference on Artificial Intelligence Applications and Innovations,Springer, 2007, pp. 93–100.

G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data miningand knowledge discovery handbook, Springer, 2009, pp. 667–685. doi: https://doi.org/10.1007/978-0-387-09823-4_34

S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell, Understanding data aug-mentation for classification: when to warp?, in: 2016 international conferenceon digital image computing: techniques and applications (DICTA), IEEE, 2016,pp. 1–6.

E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learningaugmentation policies from data, arXiv preprint arXiv:1805.09501 (2018).

G. Yan, Y. Li, S. Zhang, Z. Chen, Data augmentation for deep learning ofjudgment documents, in: International Conference on Intelligent Science andBig Data Engineering, Springer, 2019, pp. 232–242.

A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kantor, G. Kour, S. Shlomov,N. Tepper, N. Zwerdling, Do not have enough data? deep learning to the rescue!,in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 7383–7390.

J. Wei, K. Zou, Eda: Easy data augmentation techniques for boosting perfor-mance on text classification tasks, arXiv preprint arXiv:1901.11196 (2019).

V. Marivate, T. Sefara, Improving short text classification through global aug-mentation methods, in: International Cross-Domain Conference for MachineLearning and Knowledge Extraction, Springer, 2020, pp. 385–399.

Y. Li, T. Cohn, T. Baldwin, Robust training under linguistic adversity, in: Pro-ceedings of the 15th Conference of the European Chapter of the Association forComputational Linguistics: Volume 2, Short Papers, 2017, pp. 21–27.

S. Kobayashi, Contextual augmentation: Data augmentation by words withparadigmatic relations, arXiv preprint arXiv:1805.06201 (2018).

Z. Guo, J. Liu, T. He, Z. Li, P. Zhangzhu, Taujud: test augmentation of machinelearning in judicial documents, in: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2020, pp. 549–552.

R. Zmigrod, S. J. Mielke, H. Wallach, R. Cotterell, Counterfactual data augmen-tation for mitigating gender stereotypes in languages with rich morphology, in:Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics, Association for Computational Linguistics, Florence, Italy, 2019,pp. 1651–1661. doi: https://doi.org/10.18653/v1/P19-1161

G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (11) (1995) 39–41.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Languagemodels are unsupervised multitask learners, Open AI blog 1 (8) (2019) 9.

T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word repre-sentations in vector space, arXiv preprint arXiv:1301.3781 (2013).

P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors withsubword information, arXiv preprint arXiv:1607.04606 (2016).

J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for wordrepresentation, in: Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP), 2014, pp. 1532–1543.

A. Gerasimov, D. Nogueira, K. Semolini, S. Firoozkoohi, R. A. Rivera,T. a Patent, K. K. Zerling, L. Garcia-Santiago, M.-D. Olvera-Lobo, M. Aiken,et al., The efficacy of round-trip translation for mt evaluation.

S. T. Aroyehun, A. Gelbukh, Aggression detection in social media: Using deepneural networks, data augmentation, and pseudo labeling, in: Proceedings ofthe First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018),2018, pp. 90–97.

R. Sennrich, B. Haddow, A. Birch, Improving neural machine translation modelswith monolingual data, arXiv preprint arXiv:1511.06709 (2015).

M. Fadaee, A. Bisazza, C. Monz, Data augmentation for low-resource neuralmachine translation, arXiv preprint arXiv:1705.00440 (2017).

A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, Q. V.Le, Qanet: Combining local convolution with global self-attention for readingcomprehension, arXiv preprint arXiv:1804.09541 (2018).

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training ofdeep bidirectional transformers for language understanding, arXiv preprintarXiv:1810.04805 (2018).

S. Sharifirad, B. Jafarpour, S. Matwin, Boosting text classification performanceon sexist tweets by text augmentation and text generation using a combinationof knowledge graphs, in: Proceedings of the 2nd workshop on abusive languageonline (ALW2), 2018, pp. 107–114.

F. H. K. d. S. Tanaka, C. Aranha, Data augmentation using gans, arXiv preprintarXiv:1904.09135 (2019).

D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprintarXiv:1312.6114 (2013).

A. Kumar, S. Bhattamishra, M. Bhandari, P. Talukdar, Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation, in: Pro-ceedings of the 2019 Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers), 2019, pp. 3609–3619.

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving languageunderstanding by generative pre-training (2018).

N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: syntheticminority over-sampling technique, Journal of artificial intelligence research 16(2002) 321–357.

W. Zhang, T. Yoshida, X. Tang, A comparative study of tf* idf, lsi and multi-words for text classification, Expert Systems with Applications 38 (3) (2011)2758–2765.

D. Mekala, V. Gupta, B. Paranjape, H. Karnick, Scdv: Sparse composite doc-ument vectors using soft clustering over distributional representations, arXivpreprint arXiv:1612.06778 (2016).