• Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics
  • Frey, Jennifer Carmen <1988>

Subject

  • L-LIN/14 Lingua e traduzione - Lingua tedesca

Description

  • A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis.

Date

  • 2020-04-03

Type

  • Doctoral Thesis
  • PeerReviewed

Format

  • application/pdf

Identifier

urn:nbn:it:unibo-26018

Frey, Jennifer Carmen (2020) Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Traduzione, interpretazione e interculturalità , 32 Ciclo. DOI 10.6092/unibo/amsdottorato/9300.

Relations