Breadcrumb überspringen und zum Hauptmenü wechseln

Bielefeld Center for Data Science

Lecture Series Data Science

In each term BiCDaS offers a lecture series on data science. It takes place on different days throughout the term.

We are proud to once again have convinced a highly skilled set of internal and external data science professionals to join us as speakers.

The talks during the summer term 2025

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Nils Reiter

More Open Science is not always the same as better science

Date: May 20th 2025

Time: 10.15 a.m. (s.t)

Location: X-E1-201

Open Science — the move(ment) to make processes and results of scientific research accessible to everyone — has gained popularity in recent years, and the goal to employ “open science” can now be found in many proposals, statements and web pages. The talk aims at taking a critical perspective, highlighting areas in which open science may actually also have negative impact. This will be discussed in two aspects of contemporary research: peer review and large language model use. Different forms of openness can be discussed for (open) peer review, such as the anonymity of reviewers/authors, or the publication of review texts. Training of large language models is not possible without a huge amount of text, and open data makes this substantially easier. For both cases, the talk will highlight some of the negative “side effects”, and try to give some recommendations to avoid them.

The talks during the summer term 2024

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Andreas Witt

Leibniz-Institut für Deutsche Sprache

Digital Literacy mit ChatGPT: Kompetenzen zum Umgang mit digitalen Texten

Date: Monday, June 17th 2024
Time: 12:15 p.m. (st)
Location: Cafe Nordlicht

Forschungsdaten sind in den Geisteswissenschaften allgegenwärtig und Forschende müssen mit ihnen souverän umgehen können. Dies erfordert eine kontinuierliche Auseinandersetzung mit digitalen Kompetenzen, ein Bereich, dem sich die Digital Humanities seit vielen Jahren widmen. Mit der zunehmenden Verbreitung generativer Sprachmodelle hat sich der Erwerb von Wissen nicht nur beschleunigt, sondern auch individualisiert. Dieser Vortrag beleuchtet genau diese Thematik und präsentiert ChatGPT als ein leistungsstarkes Instrument zur Förderung der Digital Literacy.

Der Vortrag beginnt mit einer konzeptuellen Einführung in die Digital Literacy und hebt die essenzielle Rolle digitaler Texte in der wissenschaftlichen Arbeit hervor. Dabei wird auch das Projekt Text+, ein Konsortium innerhalb der Nationalen Forschungsdateninfrastruktur (NFDI), kurz vorgestellt.

Ein genauerer Blick richtet sich auf verschiedene Aspekte der Repräsentation digitaler Texte, angefangen von der Zeichenkodierung bis hin zur Annotation der Analyseergebnisse. Insbesondere wird das XML-basierte Annotationsframework der Text Encoding Initiative (TEI) näher betrachtet.

Neben der Repräsentation digitaler Texte wird auch die Analyse und Weiterverarbeitung der digitalen Forschungsressourcen angesprochen. Hierfür sind Programmierkenntnisse unabdingbar, weshalb auch diese erworben werden müssen. Insbesondere wird auf die Programmiersprache Python und die Open-Source-Bibliothek spaCy eingegangen, die als leistungsfähiges Werkzeug zur Verarbeitung natürlicher Sprache dient.

Dieser Vortrag basiert auf einem Kurs für Studierende der Germanistik an der Universität Mannheim, der im Frühjahrssemester 2024 stattfand. Die Erfahrungen und Erkenntnisse aus diesem Kurs "Digital Literacy mit ChatGPT" sind nicht nur für zukünftige Kurse anwendbar, sondern können auch auf ähnliche Kurse in anderen Disziplinen übertragen werden, um ein breiteres Verständnis für die digitale Kompetenz zu fördern und die Integration digitaler Werkzeuge in die akademische Forschung zu erleichtern.

The talks during the winter term 2019/20:

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Dietmar Bauer

Econometry Bielefeld University

Are we there yet? Predicting travel times in urban street networks.

Date: Friday, December 13th 2019
Time: 11:00 a.m. (st)
Location: C2-136

Knowing the travel time for a given route is of importance for logistics applications, but also for individual travel, as is documented by the provision of predictions for example in Google maps. In both applications we do not only want to know the expected travel time but also the associated uncertainty as typically being late might incur a larger penalty than being too early.

For predicting travel times a large range of different time series methods are applied using a very diverse landscape of data sources with associated strengths and problems. The methods used include a large number of time series analysis techniques dealing with univariate and multivariate data sets, linear and non-linear models.

In this talk I will describe how some of the underlying problems are solved using insights from domain knowledge and statistical data analysis methods. The main underlying theme is that brute force purely data driven modelling does not work. Also only theory driven modelling typically is not sufficient. It is the combination of these two approaches that leads to success.

I will also hint at some of the current challenges, both technological and institutional. This will lead to my answers to the question: Are we there yet, can we provide reliable travel time predictions?

Tip: Click on "View in Panopto" for more video features!

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Andreas Henrich

Media Informatics

University of Bamberg

Research data federation in the arts and humanities: background, concepts and services in DARIAH-DE

Date: Friday, January 10th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136

A fundamental function of cultural heritage projects and institutions consists in the collection, cataloguing and persistent storage of material of human culture. Challenges emerge with upcoming needs to digitize such material and to provide it to research communities or the general public. A large amount of cultural heritage collections are stored in depots and might never reach a digitally published state - mainly due to the necessary selectivity of digitization projects and the amount and diversity of artifacts that are being acquired and collected. With this respect, the openness, interoperability and sustainability of repositories are of particular importance as these requirements form a foundation for a long-term preservation of digital cultural heritage along with the scholarly discourse.

In this talk, we discuss the value and benefits that an adoption of the FAIR-principles [Fußnote: findable, accessible, interoperable, reusable; see: https://www.go-fair.org/fair-principles] has on research data management in the digital humanities. After a short glance on current perspectives and requirements of funding organizations with regard to the FAIR-principles and their adoption, we present results of the Digital Research Infrastructure for the Arts and Humanities (DARIAH). The BMBF-funded initiative has developed strategies and services, which facilitate the introduction of FAIR-principles to collections of cultural heritage. In our talk, we particularly focus on methods that allow the explication of contextual knowledge on existing collections to achieve interpretability of their content in integrative contexts and hence supporting such collections to become FAIR.

Tip: Click on "View in Panopto" for more video features!

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Tobias Schäfers

Marketing

Copenhagen Business School

Stefan Brinkhoff

Marketing

TU Dortmund

Mobile In-Store Analytics: Examining Shopper Behavior and Reactions to Location-Based Mobile Promotions

Date: Friday, January 17th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136

Sensors embedded in mobile phones allow for analyzing contextual factors, such as consumers' current outdoor location, and use it for targeting purposes, which was shown to improve customer responses to mobile promotions. However, many customers also use their phones inside of stores, which provides opportunities for better understanding in-store behavior and for location-based targeting during the actual shopping process. We present the results of a unique large-scale field test, which we conducted in cooperation with a fashion retailer. Specifically, we developed and employed a mobile application that is capable of tracking customers in the store and of delivering individualized promotion messages. We use the application to examine the effects of in-store behavior on purchases, and to conduct a between-subjects experiment that tests the effects of different location-based promotion types.

There is, unfortunately no recording of this lecture available. We apologise.

Zum Hauptinhalt der Sektion wechseln

Dr. Hendrik Ballhausen

SMPC

Ludwig-Maximilians-Universität München

Secure Multiparty Computation - digital collaboration on private data

Date: Friday, January 24th 2020
Time: 11:00 a.m. (st)
Location: C2-136

The free flow of information is the life blood of digital industries and much of modern science. Yet collaboration is impeded by inefficiently low data sharing as proprietary data is kept in private siloes. Often, data exchange is prevented by competition and lack of trust between data owners, consumer and privacy concerns, and data protection regulation. Is there a way to reconcile digital collaboration with data privacy?

Secure Multiparty Computation (SMPC), a disruptive technology, promises to do just that. It simulates a virtual trusted third party as a cryptographic network between the parties. The network only exists as long as all parties actively engage in the the calculation, and the joint result is distributed to all. The individual private data, however, remains with the original data owners, and nothing can be learned by an external or internal attacker except for the intended result of the computation.

In many ways, secure multiparty computation is similar to the blockchain. Both work in trustless settings without a central authority. However, while distributed ledger technologies provide trust, reliability, and transparency; secure multiparty computation provides privacy, security, dynamic consent, and control over proprietary data.

This talk provides a brief non-technical introduction to secure multiparty computation. A proof-of-principle is presented in which proprietary patient data was jointly evaluated between two remote university hospitals.

Tip: Click on "View in Panopto" for more video features!

Zum Hauptinhalt der Sektion wechseln

Friedrich Summann

Universitätsbibliothek

Bielefeld LibTec

Data Science Techniques in Digital Humanities and Digital Knowledge Services

Date: Januar, 31st 2020
Time: 11:00 a.m. (st)
Location: C2-136

From 2015 to 2019, the Faculty of Linguistics and Literary Studies of Bielefeld University carried out the DFG project Kinder und Jugendliteratur im Medienverbund 1933 -1945. The University Library has developed the technical infrastructure for data acquisition, data processing and data visualisation. Based on a complex data model for describing different types of media (especially films, radio broadcasts and print editions, but also theatre performances, records, television broadcasts, advertising material) as a manifestation of literary material, the different linkages have to be worked out and visualized by visualization techniques. The lecture introduces the used approaches, methods and their implementation in detail and also reports on the use of these techniques in other digital services of the University Library. This includes in particular analyses of the global publication network in the context of bibliometric information.

Tip: Click on "View in Panopto" for more video features!

Zum Hauptinhalt der Sektion wechseln

Prof. Dr. Douglas M Bates

Statistics University of Wisconsin - Madison

The R-Project

Programming Languages in Data Science

Date: February 14th, 2020
Time: 11:00 a.m. (st)
Location: C2-136

At present R and Python are the most widely used programming languages for data science. R, based on the earlier language S, developed at Bell Labs in the 1980's, is, by design, intended for data analysis and graphics applications.

Python, developed in the 1990's for general applications, has seen widespread adoption in the scientific computing and data science communities. Both of these are dynamic languages that can be used interactively in a REPL (read-eval-print-loop) allowing for rapid prototyping of algorithms and on-the-fly analyses. Both languages also have well-developed infrastructure including integrated development environments and thousands of user-contributed packages available on repositories. Both languages are Open Source and freely available.

As a member of the core development team for R I have considerable experience using R and developing packages for R. I have also used Python though not to the same extent. Despite a long history with these languages I recently switched my development efforts to Julia, primarily for flexibility.

I will compare and contrast these languages and explain why I choose to use the less mature but more flexible one for data science.

The talks during the winter term 2018/19:: Improving Health through Data Science

Prof. Dr. Christiane Fuchs

Data Science

Bielefeld University

Date: Friday, October 19th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)

Data science refers to the acquisition of knowledge from data. The field is increasingly important as more and more data is being collected in a rising number of areas and has great potential for optimizing processes. One of these areas is medicine. Medical data is growing in volume and complexity, for example through the accelerated digitization of patient data or improved high-throughput technologies in molecular biology. Processing such data requires both statistical and computational expertise as well as medical domain knowledge for correct interpretation.

I will present projects in which interdisciplinary cooperation between statistics / data science and medicine / biology led to a gain in knowledge and thus to improved diagnostics: On the one hand from the field of risk prediction for prostate cancer patients and childhood asthma; here we were able to use and combine machine learning techniques in such a way that the complex data base available was used more effectively and we were able to improve previously used forecasting methods. On the other hand, for the detection of transcriptional heterogeneity in leukemia patients; here, we use an RNA measurement method that sequences small amounts of cells rather than single cells to reduce cost, effort, unwanted effects of cell isolation and, most importantly, technical errors. In the case of heterogeneous cell populations, the resulting data is a tangle of signals. A statistical algorithm developed by us can extract the single-cell information again.

Unfortunately there is no recording of this lecture available. Our apologies!

Supporting Meta-Analysis with Semantic Technologies: the case of managing pre-clinical evidence for the treatment of spinal cord injuries

Prof. Dr. Philipp Cimiano

Semantic Computing

Bielefeld University

Date: Friday, December 14th
Time: 11:00 a.m. (s.t.)
Location: C2-136

In many knowledge-intensive areas, experts are overwhelmed with information. As the number of publications in many fields is growing at an exponential rate, it is becoming harder and harder for experts to keep up and distill the "evidence" or knowledge from the available published knowledge. New approaches to structuring and managing evidence to support insight generation and answering of specific questions are crucial. Machines can support the task of structuring the available evidence. But there are a number of challenges to face. First, knowledge is published in unstructured form, so we need to teach machines to extract the relevant insights from published articles. Second, we need ontologies or knowledge representation approaches to represent the knowledge to support cross-document aggregation as rarely only one document contains the relevant answer or insight to a question.

We present results of the BMBF-funded project PSINK that seeks to develop a novel approach to systematizing evidence in the field of medicine, in particular in the area dealing with the treatment of spinal cord injuries, for which no successful treatment exists as of today. The project builds on natural language processing techniques and semantic technologies to develop a knowledge base that eventually should contain all the pre-clinical knowledge available on the efficacy of spinal cord injury therapies. This will support a novel concept what we call "meta-analysis on demand".

Unfortunately no slides are available for this lecture. Our apologies!

Tip: Click on "View in Panopto" for more video features!

A Glimpse into Microbial Dark Matter through the de.NBI Cloud

Dr. Alexander Sczyrba

Computational Metagenomics

Bielefeld University

Date: Friday, January 18th
Time: 11:00 a.m. (s.t.)
Location: C2-136

Microorganisms are the most abundant cellular life forms on Earth, occupying even the most extreme environments. The large majority of these organisms have not been obtained in pure culture and therefore have long been inaccessible to genome sequencing, which would provide blueprints for the evolutionary and functional diversity that shapes our biosphere. Since the advent of next-generation sequencing technologies, metagenomics and single cell genomics can shed light on the uncharted branches of the tree of life. While representing very complementary approaches, both technologies can recover microbial genomes from environmental samples, with their own strengths and weaknesses.

More than 100,000 metagenomic datasets of hundreds of Terabytes in size are currently available in public data repositories and can be mined for new representatives of candidate phyla to obtain genomes of the underrepresented branches of the tree of life. Also, these metagenomic data sets are invaluable in bioprospecting, an approach for screening for molecules and activities from environmental samples with biotechnological potential. However, for many small research labs these data remain inaccessible due to the lack of computational resources.

Cloud computing offers a solution, as it provides compute and storage capacities at scale. The CeBiTec at Bielefeld University is operating an OpenStack-based cloud computing infrastructure for the life science community within the German Network for Bioinformatics Infrastructure (de.NBI). The de.NBI Cloud (https://cloud.denbi.de/) is a full academic cloud federation, providing compute and storage resources free of charge for academic users. It provides a powerful IT infrastructure in combination with flexible bioinformatics workflows and analysis tools to the life science community in Germany. In my presentation I will show how the de.NBI Cloud can close the gap of missing computational resources for metagenomics research as one example.

Unfortunately no slides are available for this lecture. Our apologies!

Tip: Click on "View in Panopto" for more video features!

Data, modeling, inferential and computational thinking: components in data literacy education

Prof. Dr. Karsten Lübke

Mathematical Economics and Statistics

FOM University of Applied Science

Date: Friday, January 25th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)

Statistical thinking and computational skills together with real world applications are regarded as fundamental elements in data literacy education. Data modeling and simulation based inference may facilitate conceptual understanding in all domains of data literacy. This can be achieved by a re-thinking of the consensus based curriculum. It is time to start - and for a first review of the lessons learned.

Tip: Click on "View in Panopto" for more video features!

Data Science on music data

Prof. Dr. Claus Weihs

Computer-Supported Statistics

TU Dortmund

Date: Wednesday, January 30th
Time: 10:00 a.m. (s.t.)
Location: C2-136

In this talk we structure the field of data science and substantiate our key premise that statistics is one of the most important disciplines in data science and the most important discipline to analyze and quantify uncertainty. As an application, the talk demonstrates data science methods on music data for automatic transcription and automatic genre determination, both on the basis of signal-based features from audio recordings of music pieces.

Literature:
Claus Weihs und Katja Ickstadt (2018): Data Science: The Impact of Statistics; International Journal of Data Science and Analytics 6, 189-194

Claus Weihs, Dietmar Jannach, Igor Vatolkin und Günter Rudolph (2017): Music Data Analysis: Foundations and Applications; CRC Press, Taylor & Francis, 675 pages

Tip: Click on "View in Panopto" for more video features!

An introduction to the Socio-Economic Panel (SOEP) Study

Dr. Jan Goebel

German Socio- Economic Panel Study

German Institute for Economic Research

Date: Wednesday , May 22nd
Time: 10:15 a.m.
Location: X-E0-220

Nearly 15,000 households and about 30,000 persons participate in the SOEP survey. The SOEP provides a broad set of self-reported "objective" variables, such as income, age, gender, education, employment status, or gripping force, and a broad set of self-reported "subjective" variables, such as from satisfaction with life, over fairness and reciprocity perceptions to psychological measurement like the "Big Five."

Running for already 35 years, SOEP gathers information from a spectrum of birth cohorts. As such, it is a valuable empirical basis for researchers to explore long-time societal changes; relationships between early life events on later life outcomes; interdependencies between the individual and the family or household; mechanisms of inter-generational mobility and transmission; accumulation processes of resources; short- and long-term effects of institutional change and policy reforms; speed of convergence between East and West or between migrants and natives.

The talk will give an overview about the basic features of the SOEP - from the basic sampling strategy to the structure of the released data. How external users can access the data and in which ways the SOEP data can be enriched using auxiliary datasets like geocoded data. Addtionally it will also give an overview about the SOEP Innovation Sample (SOEP-IS) and how external researchers can submit proposals. SOEP-IS can accommodate not only short-term experiments but also longer-term survey modules that are not suitable for SOEP-Core, whether because the survey instruments are still relatively new or because of the specific issues dealt with in the research.

Tip: Click on "View in Panopto" for more video features!
Lecture Series 2017/18: Data Science and Digital Humanities, Models, Practices, and Interpretations

Dr. Silke Schwandt

Medival History

Bielefeld University

In the 2017/18 winter term BiCDaS offered a lecture series on selected Data Science topics. Six different experts delivered presentations that provided insight into their research in various fields of data science.

Although the lecture series is over, you can still watch most talks below. It is a great collection of resources and showcases the wide range of topics in Data Science.

Tip: Click on "View in Panopto" for more video features!

Cryptographic Access Control—Reconciling Security and Privacy

Prof. Dr. Johannes Blömer

Codes and Cryptography

University of Paderborn

Auditory Data Science

Dr. Thomas Hermann

Ambient Intelligence

Bielefeld University

Unfortunately there is no recording of this lecture available. Our apologies!

Data? What for? Thoughts around research question based data collection

Dr. Odile Sauzet

Statistical Consulting Center

Bielefeld University

Tip: Click on "View in Panopto" for more video features!

Enabling Data-Intensive Science

Prof. Dr. Achim Streit

Steinbuch Centre for Computing

Karlsruhe Institute of Technology

Tip: Click on "View in Panopto" for more video features!

Big Data: Challenges and Some Solutions

Prof. Dr. Volker Markl

Database Systems and Information Management

Berlin Big Data Center

TU Berlin

Bielefeld Center for Data Science

Lecture Series Data Science

The talks during the summer term 2025

More Open Science is not always the same as better science

The talks during the summer term 2024

Prof. Dr. Andreas Witt

Digital Literacy mit ChatGPT: Kompetenzen zum Umgang mit digitalen Texten

The talks during the winter term 2019/20:

Prof. Dr. Dietmar Bauer

Are we there yet? Predicting travel times in urban street networks.

Prof. Dr. Andreas Henrich

Research data federation in the arts and humanities: background, concepts and services in DARIAH-DE

Prof. Dr. Tobias Schäfers

Stefan Brinkhoff

Mobile In-Store Analytics: Examining Shopper Behavior and Reactions to Location-Based Mobile Promotions

Dr. Hendrik Ballhausen

Secure Multiparty Computation - digital collaboration on private data

Friedrich Summann

Data Science Techniques in Digital Humanities and Digital Knowledge Services

Prof. Dr. Douglas M Bates

Programming Languages in Data Science

Improving Health through Data Science

Supporting Meta-Analysis with Semantic Technologies: the case of managing pre-clinical evidence for the treatment of spinal cord injuries

A Glimpse into Microbial Dark Matter through the de.NBI Cloud

Data, modeling, inferential and computational thinking: components in data literacy education

Data Science on music data

An introduction to the Socio-Economic Panel (SOEP) Study

Data Science and Digital Humanities, Models, Practices, and Interpretations

Cryptographic Access Control—Reconciling Security and Privacy

Auditory Data Science

Data? What for? Thoughts around research question based data collection

Enabling Data-Intensive Science

Big Data: Challenges and Some Solutions