Medical Informatics Center - Dept. Medical Data Science

Thesis Topics

The Medical Data Science group currently offers the thesis topics listed below. Please contact the shown collegues if you have questions.

Synthetic Data Generation

Synthetic Data Generation in the Medical Domain using Generative Adversarial Networks

Synthetic data generation has become important in the medical field for two reasons. First, medical data are not widely available to researchers due to patient privacy protections. Second, in some cases such as rare diseases, only limited data are available, making diagnosis or treatment difficult even for experts. Synthetic medical data generation can address these issues by providing artificial medical data that resembles real data while not associated with real patients.

The goal of this thesis is to generate synthetic tabular data relevant to the medical field using generative adversarial networks (GANs) and to investigate how synthetic data can improve data analysis (machine learning) performance compared to cases where only real data are considered.

Contact person: Dr. Sina Sadeghi <sina.sadeghi@medizin.uni-leipzig.de>

Discriminatory Power of Record Attributes for Record Linkage

Methods for estimating discriminatory power of data record attributes in privacy-preserving record linkage

Record linkage is the challenge of identifying duplicate data records across data sources. Not every attribute holds the same power in telling two records apart, e.g. gender vs. birth date for personal records. There exist a few algorithms for computing this “discriminatory power” of attributes given two datasets and a set of known true matches. However, in the context of privacy-preserving record linkage (PPRL), this is not possible. Access to data is tightly regulated and there exists no “gold standard” for true matches. The goal of the thesis is to research methods for computing the discriminatory power of record attributes, given a single dataset in the absence of a standard for true matches, and to evaluate them on an in-house PPRL framework.

Contact person: Maximilian Jugl, email: Maximilian.Jugl@medizin.uni-leipzig.de

Blocking for Record Linkage on Distributed Datasets

Evaluation of blocking methods for privacy-preserving record linkage with distributed datasets

Record linkage is the challenge of identifying duplicate data records across data sources. Linking records in a pairwise way is infeasible, since the runtime grows dramatically the more records there are per dataset. Within the field of record linkage, there exist several techniques of grouping similar records together at their respective sources, therefore reducing the runtime of the subsequent linkage process. This grouping step is called “blocking”. There also exist some methods that can be applied to privacy-preserving record linkage (PPRL) in particular, but none of them have been applied to an actual test case with datasets that are distributed across several locations. The goal of the thesis is to implement, evaluate and demonstrate blocking methods on an in-house PPRL framework in an environment with distributed datasets.

Contact person: Maximilian Jugl, email: Maximilian.Jugl@medizin.uni-leipzig.de

XNAT Plugin for RedCAP import

XNAT system is for storing and working with image data in the medical domain but can. Also contains other data. There are plugins the admin can install and also run which can run code within the XNAT system to produce some calculation or other thing.

The RedCAP system is a broadly used tool for creating and entering structured data within the medical field. Both systems have an API and can push and pull data. For clinical image data for a doctor or a data scientist it would be helpful to gain a detailed overview over the XNAT system and look this up in the REDCap system.

Contact person: Lars Hempel lars.hempel@medizin.uni-leipzig.de

NLP for Opinion Mining in Medical Evaluation Blogs

NLP approaches for information extraction about patient care evaluations from public blog system

Evaluations and review statements from patients about the patient care process in a hospital are eminently important for the hospital for improving internal processes but also to detect hate speech. Publicly, such evaluations and reviews are available in blog systems.
The goal of this thesis is to use modern methods from Natural Language Processing (NLP) to extract reasonable information from such patient reviews in a way it can be used for further processing such as opinion mining and hate speech detection.

Contact person: Macedo Maia <macedo.sousamaia@medizin.uni-leipzig.de>

Python Data Quality Assessment

Data Quality Assessment Library for structured Patient Data

Assessing the data quality is an important step before the analysis can start. The current state of the data quality of a data set can be obtained in different dimensions including integrity, consistency, completeness, and accuracy. Each of these dimensions contains different domains and these include again indicators acting as measurable items for data quality. The goal of this thesis is to design a python library including most of the already defined data quality indicators for structured data.

Contact person: Dr. Sina Sadeghi <sina.sadeghi@medizin.uni-leipzig.de>

Stephanstraße 9c, Haus MDS

04103 Leipzig

Telefon:

0341 - 97 10283

E-Mail:

toralf.kirsten@medizin.uni-leipzig.de

7nach oben