MA Thesis for Library and Information Science

An overview of my MA(LIS) thesis, tracking its progress

(this document is originally in German)

General Information

Title: Publication Practices for Research Data in University Theses
Subtitle: An Examination of Publication Formats and Methods

University: Humboldt University of Berlin
Faculty: Faculty of Philosophy
Institute: Institute of Library and Information Science

Reviewer 1: Dr. Sarah Dellmann
Reviewer 2: Prof. Dr. Robert Jäschke

Exposé

Introduction

There are three publication forms for research data (RD) in academic theses (AT) (Reilly et al., 2011, pp. 5 f.):

Fully integrated data in AT (e.g., tables and graphics embedded in the PDF file of the AT),
Data attached to AT (e.g., files uploaded to the university’s publication server along with the PDF file of the thesis)
Data uploaded to a separate repository referenced within the AT

In the academic context, prescriptive articles from the DFG-funded project eDissPlus (Weisbrod et al., 2017; Kleineberg & Kaden, 2018; Weisbrod, 2018) as well as the Policy for Dissertation-related Research Data of the German National Library (Deutsche Nationalbibliothek [DNB], 2017) increasingly provide guidelines for handling RD for AT. However, comprehensive studies on the effectiveness or enforcement of these guidelines among students are lacking (e.g., through corresponding examination regulations and consultations on this topic by university libraries). So far, there are at most highly specialized and discipline-specific studies.

This master’s thesis intends to provide a more general investigation in this regard.

Research Question

Main Research Question

In what way were RD from AT published in the institutional repository of Leibniz University Hannover (LUH Repository) until December 2023?

This can be divided into the following subordinate research questions:

What proportion of AT had RD published as part of the PDF file?
What proportion of AT had RD published as a separate file in the form of a supplement?
What proportion of AT had RD published in a separate repository?
How are RD in AT distinguished and linked with the text of the AT?
How is it made visible in the metadata of AT that there are associated research data?

Subsidiary Research Question

To what extent have recommendations regarding RD in AT already been anchored in examination regulations and other guiding documents at German universities?

Methodology

For answering these research questions, the work process for the master’s thesis is divided into four modules:

The analysis of German doctoral regulations and overarching guidelines regarding RD.
The manual classification of AT in the LUH Repository regarding RD.
The evaluation of the results from the first two modules focusing on potential recommendations regarding RD.
The training of a model for automatic classification of AT regarding RD based on the results of the preceding manual classification work.

Module 1: Doctoral Regulations

Here, the doctoral regulations and other relevant guiding documents of a simple sample (n=173) of all universities eligible for doctoral studies in Germany (n=313) are examined. The sample size was calculated with a confidence interval of 95 % and a margin of error of 5 %.

Module 2: Manual AT Classification

Here, a multi-layered sample of AT in the LUH Repository is manually classified based on whether the AT,

have no RD,
have RD as part of the PDF file,
have RD as attached file(s), or
have RD in an external repository.

The sample is stratified by the faculties of LUH and by four 3-year stages. For this module, administrative access to the LUH Repository is obtained. The exact sample size can only be calculated with this access. The classification itself considers the content of the PDF file as well as the associated metadata in the LUH Repository.

Module 3: Evaluation & Recommendations

Here, the results of the first two modules are evaluated, and concepts are developed based on the data obtained on how to achieve better handling of RD in AT and which target groups these efforts should primarily address.

Module 4: Training of the Classification Model

Here, the results of the previous classification work are used to train a model that can classify the remaining AT in the LUH Repository according to RD status. The training and construction of the model are expected to follow Younes and Scherp’s work on identifying and extracting datasets in scientific articles (missing reference).

Depending on whether LUH has the resources for result control, either a one-step procedure (direct identification and extraction via a pre-trained language model like DeBERTa in question-answer mode) or a two-step procedure (filtering via an MLP followed by extraction via a pre-trained language model like RoBERTa) will be used here. The former (according to current expectations) has higher precision and therefore requires less post-processing, but has lower recall. The latter (according to current expectations) has higher recall but lower precision.

Schedule

gantt
    title Schedule for the Master's Thesis
    dateFormat YYYY-MM-DD
    tickInterval 1month
    weekday monday
    todayMarker on
        Prep.     :v1, 2024-02-15, 14d
        M. 1          :m1, 2024-02-29, 14d
        Module 2          :m2, 2024-03-14, 30d
        M. 3          :m3, 2024-04-06, 14d
        Module 4          :m4, 2024-04-11, 40d
        Writing Phase     :s1, 2024-05-11, 34d

Figure: A provisional schedule for the completion of the master's thesis as a Gantt chart.

PDF Version

A German PDF version of this proposal (without the Gantt chart) can be downloaded here.

Current Status

References

2018

B-FDM

Zur Veröffentlichung dissertationsbezogener Forschungsdaten: Perspektiven und Kompetenzen von Promovierenden an Berliner Universitäten

Michael Kleineberg, and Ben Kaden

Bausteine Forschungsdatenmanagement, Oct 2018

HTML
O-BIB

Pflichtablieferung von Dissertationen mit Forschungsdaten an die DNB – Anlagerungsformen und Datenmodell

Dirk Weisbrod

o-bib. Das offene Bibliotheksjournal, Jul 2018

HTML

2017

HUB

eDissPlus – Optionen für die Langzeitarchivierung dissertationsbezogener Forschungsdaten aus Sicht von Bibliotheken und Forschenden

Dirk Weisbrod, Ben Kaden, and Michael Kleineberg

In E-Science-Tage: Forschungsdaten managen, Jul 2017

HTML
DNB

Policy der Deutschen Nationalbibliothek für dissertationsbezogene Forschungsdaten

Deutsche Nationalbibliothek [DNB]

Jul 2017

HTML

2011

OfDE

Opportunities of Data Exchange: Report on Integration of Data and Publications

Susan Reilly, Wouter Schallier, Sabine Schrimpf, and 2 more authors

Jul 2011

HTML