MA Thesis for Library and Information Science
An overview of my MA(LIS) thesis, tracking its progress
(this document is originally in German)
General Information
Title: Publication Practices for Research Data in University Theses
Subtitle: An Examination of Publication Formats and Methods
University: Humboldt University of Berlin
Faculty: Faculty of Philosophy
Institute: Institute of Library and Information Science
Reviewer 1: Dr. Sarah Dellmann
Reviewer 2: Prof. Dr. Robert Jäschke
Exposé
Introduction
There are three publication forms for research data (RD) in academic theses (AT) (Reilly et al., 2011, pp. 5 f.):
- Fully integrated data in AT (e.g., tables and graphics embedded in the PDF file of the AT),
- Data attached to AT (e.g., files uploaded to the university’s publication server along with the PDF file of the thesis)
- Data uploaded to a separate repository referenced within the AT
In the academic context, prescriptive articles from the DFG-funded project eDissPlus (Weisbrod et al., 2017; Kleineberg & Kaden, 2018; Weisbrod, 2018) as well as the Policy for Dissertation-related Research Data of the German National Library (Deutsche Nationalbibliothek [DNB], 2017) increasingly provide guidelines for handling RD for AT. However, comprehensive studies on the effectiveness or enforcement of these guidelines among students are lacking (e.g., through corresponding examination regulations and consultations on this topic by university libraries). So far, there are at most highly specialized and discipline-specific studies.
This master’s thesis intends to provide a more general investigation in this regard.
Research Question
Main Research Question
In what way were RD from AT published in the institutional repository of Leibniz University Hannover (LUH Repository) until December 2023?
This can be divided into the following subordinate research questions:
- What proportion of AT had RD published as part of the PDF file?
- What proportion of AT had RD published as a separate file in the form of a supplement?
- What proportion of AT had RD published in a separate repository?
- How are RD in AT distinguished and linked with the text of the AT?
- How is it made visible in the metadata of AT that there are associated research data?
Subsidiary Research Question
To what extent have recommendations regarding RD in AT already been anchored in examination regulations and other guiding documents at German universities?
Methodology
For answering these research questions, the work process for the master’s thesis is divided into four modules:
- The analysis of German doctoral regulations and overarching guidelines regarding RD.
- The manual classification of AT in the LUH Repository regarding RD.
- The evaluation of the results from the first two modules focusing on potential recommendations regarding RD.
- The training of a model for automatic classification of AT regarding RD based on the results of the preceding manual classification work.
Module 1: Doctoral Regulations
Here, the doctoral regulations and other relevant guiding documents of a simple sample (n=173) of all universities eligible for doctoral studies in Germany (n=313) are examined. The sample size was calculated with a confidence interval of 95 % and a margin of error of 5 %.
Module 2: Manual AT Classification
Here, a multi-layered sample of AT in the LUH Repository is manually classified based on whether the AT,
- have no RD,
- have RD as part of the PDF file,
- have RD as attached file(s), or
- have RD in an external repository.
The sample is stratified by the faculties of LUH and by four 3-year stages. For this module, administrative access to the LUH Repository is obtained. The exact sample size can only be calculated with this access. The classification itself considers the content of the PDF file as well as the associated metadata in the LUH Repository.
Module 3: Evaluation & Recommendations
Here, the results of the first two modules are evaluated, and concepts are developed based on the data obtained on how to achieve better handling of RD in AT and which target groups these efforts should primarily address.
Module 4: Training of the Classification Model
Here, the results of the previous classification work are used to train a model that can classify the remaining AT in the LUH Repository according to RD status. The training and construction of the model are expected to follow Younes and Scherp’s work on identifying and extracting datasets in scientific articles (missing reference).
Depending on whether LUH has the resources for result control, either a one-step procedure (direct identification and extraction via a pre-trained language model like DeBERTa in question-answer mode) or a two-step procedure (filtering via an MLP followed by extraction via a pre-trained language model like RoBERTa) will be used here. The former (according to current expectations) has higher precision and therefore requires less post-processing, but has lower recall. The latter (according to current expectations) has higher recall but lower precision.
Schedule
gantt
title Schedule for the Master's Thesis
dateFormat YYYY-MM-DD
tickInterval 1month
weekday monday
todayMarker on
Prep. :v1, 2024-02-15, 14d
M. 1 :m1, 2024-02-29, 14d
Module 2 :m2, 2024-03-14, 30d
M. 3 :m3, 2024-04-06, 14d
Module 4 :m4, 2024-04-11, 40d
Writing Phase :s1, 2024-05-11, 34d
PDF Version
A German PDF version of this proposal (without the Gantt chart) can be downloaded here.
Current Status
- Preparation Phase
- Create (Lua)LaTeX template (available on GitHub)
- Obtain access to TIB Confluence
- Obtain access to TIB Remote Desktop
-
Optional: Get access to Linux system up and running
-
- Obtain administrative access to the LUH Repository
- Processing Phase
- Module 1
- List of all German universities
- Filter list by eligibility for doctoral studies
- Create script for seed-based random selection from university list (Result: downloadable here)
- Take a simple random sample
- Collect doctoral regulations & other relevant documents of the sample
- Evaluate doctoral regulations of the sample
- Module 2
- Download metadata of all LUH Repository dissertations
- Find a way to automatically download all relevant files
- Check if DSpace 5 provides internal function (Result: not available)
- Create script that downloads all PDF files and accompanying files
- Create script to stratify dissertations into Year+Faculty groupings
- Take stratified random sample
- Reevaluate stratification based on output (Result: switch to 3 year groupings with 4 years each instead of 4 year grouping with 3 years each)
- Download all relevant files
- Decide on metadata scheme to classify research data for subsequent upload of classification into DSpace
- Evaluate all dissertations
- Check for internal research data
- Check for accompanying research data
- Check for external research data
- Module 3
- Module 4
- Sort PDF files
- Install Grobid
- Convert PDF files to TEI-XML files
- Sort TEI-XML files by language
- Check TEI-XML data quality
- Create CSV dataset (by paragraph)
- Classify paragraphs of dataset)
- Write model training script
- Evaluate performance
- Module 1
- Writing Phase
- Introduction
- First draft
- Final version
- Module 1
- First draft
- Final version
- Module 2
- First draft
- Final version
- Module 3
- First draft
- Final version
-
Module 4 - Conclusion
- Introduction
- Submission
- Upload to Zenodo with embargo
- MA Thesis (DOI: 10.5281/zenodo.11506621)
- Dataset (DOI: 10.5281/zenodo.11401021)
- Send to printers
- Send per mail
- Send per e-mail
- Upload to Zenodo with embargo
References
2018
- B-FDMZur Veröffentlichung dissertationsbezogener Forschungsdaten: Perspektiven und Kompetenzen von Promovierenden an Berliner UniversitätenBausteine Forschungsdatenmanagement, Oct 2018
- O-BIBPflichtablieferung von Dissertationen mit Forschungsdaten an die DNB – Anlagerungsformen und Datenmodello-bib. Das offene Bibliotheksjournal, Jul 2018
2017
- HUBeDissPlus – Optionen für die Langzeitarchivierung dissertationsbezogener Forschungsdaten aus Sicht von Bibliotheken und ForschendenIn E-Science-Tage: Forschungsdaten managen, Jul 2017
- DNB
2011
- OfDE