Universität Bielefeld

© Universität Bielefeld

Bielefeld  Center
for  Data  Science

Lecture Series Data Science

In each winter term BiCDaS offers a lecture series on data science. It takes place on six different Fridays throughout the term.

We are proud to once again have convinced a highly skilled set of internal and external data science professionals to join us as speakers.

The talks during the winter term 2019/20:

Prof Dr Dietmar BauerEconometry
Bielefeld University

  Are we there yet? Predicting travel times in urban street networks.

Date: Friday, December 13th 2019
Time: 11:00 a.m. (st)
Location: C2-136

Knowing the travel time for a given route is of importance for logistics applications, but also for individual travel, as is documented by the provision of predictions for example in Google maps. In both applications we do not only want to know the expected travel time but also the associated uncertainty as typically being late might incur a larger penalty than being too early.

For predicting travel times a large range of different time series methods are applied using a very diverse landscape of data sources with associated strengths and problems. The methods used include a large number of time series analysis techniques dealing with univariate and multivariate data sets, linear and non-linear models.

In this talk I will describe how some of the underlying problems are solved using insights from domain knowledge and statistical data analysis methods. The main underlying theme is that brute force purely data driven modelling does not work. Also only theory driven modelling typically is not sufficient. It is the combination of these two approaches that leads to success.

I will also hint at some of the current challenges, both technological and institutional. This will lead to my answers to the question: Are we there yet, can we provide reliable travel time predictions?

Prof Dr Andreas HenrichMedia Informatics
University of Bamberg

  Research data federation in the arts and humanities: background, concepts and services in DARIAH-DE

Date: Friday, January 10th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136

A fundamental function of cultural heritage projects and institutions consists in the collection, cataloguing and persistent storage of material of human culture. Challenges emerge with upcoming needs to digitize such material and to provide it to research communities or the general public. A large amount of cultural heritage collections are stored in depots and might never reach a digitally published state - mainly due to the necessary selectivity of digitization projects and the amount and diversity of artifacts that are being acquired and collected. With this respect, the openness, interoperability and sustainability of repositories are of particular importance as these requirements form a foundation for a long-term preservation of digital cultural heritage along with the scholarly discourse.

In this talk, we discuss the value and benefits that an adoption of the FAIR-principles [Fußnote: findable, accessible, interoperable, reusable; see: https://www.go-fair.org/fair-principles] has on research data management in the digital humanities. After a short glance on current perspectives and requirements of funding organizations with regard to the FAIR-principles and their adoption, we present results of the Digital Research Infrastructure for the Arts and Humanities (DARIAH). The BMBF-funded initiative has developed strategies and services, which facilitate the introduction of FAIR-principles to collections of cultural heritage. In our talk, we particularly focus on methods that allow the explication of contextual knowledge on existing collections to achieve interpretability of their content in integrative contexts and hence supporting such collections to become FAIR.

Prof Dr Tobias SchäfersMarketing
Copenhagen Business School
Stefan BrinkhoffMarketing
TU Dortmund

 Mobile In-Store Analytics: Examining Shopper Behavior and Reactions to Location-Based Mobile Promotions

Date: Friday, January 17th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136

Sensors embedded in mobile phones allow for analyzing contextual factors, such as consumers' current outdoor location, and use it for targeting purposes, which was shown to improve customer responses to mobile promotions. However, many customers also use their phones inside of stores, which provides opportunities for better understanding in-store behavior and for location-based targeting during the actual shopping process. We present the results of a unique large-scale field test, which we conducted in cooperation with a fashion retailer. Specifically, we developed and employed a mobile application that is capable of tracking customers in the store and of delivering individualized promotion messages. We use the application to examine the effects of in-store behavior on purchases, and to conduct a between-subjects experiment that tests the effects of different location-based promotion types.

Dr Hendrik BallhausenSMPC
Ludwig-Maximilians-Universität München

  Secure Multiparty Computation - digital collaboration on private data

Date: Friday, January 24th 2020
Time: 11:00 a.m. (st)
Location: C2-136

The free flow of information is the life blood of digital industries and much of modern science. Yet collaboration is impeded by inefficiently low data sharing as proprietary data is kept in private siloes. Often, data exchange is prevented by competition and lack of trust between data owners, consumer and privacy concerns, and data protection regulation. Is there a way to reconcile digital collaboration with data privacy?

Secure Multiparty Computation (SMPC), a disruptive technology, promises to do just that. It simulates a virtual trusted third party as a cryptographic network between the parties. The network only exists as long as all parties actively engage in the the calculation, and the joint result is distributed to all. The individual private data, however, remains with the original data owners, and nothing can be learned by an external or internal attacker except for the intended result of the computation.

In many ways, secure multiparty computation is similar to the blockchain. Both work in trustless settings without a central authority. However, while distributed ledger technologies provide trust, reliability, and transparency; secure multiparty computation provides privacy, security, dynamic consent, and control over proprietary data.

This talk provides a brief non-technical introduction to secure multiparty computation. A proof-of-principle is presented in which proprietary patient data was jointly evaluated between two remote university hospitals.

Friedrich SummannUniversitätsbibliothek Bielefeld

  Data Science Techniques in Digital Humanities and Digital Knowledge Services

Date: Januar, 31st 2020
Time: 11:00 a.m. (st)
Location: C2-136

From 2015 to 2019, the Faculty of Linguistics and Literary Studies of Bielefeld University carried out the DFG project Kinder und Jugendliteratur im Medienverbund 1933 -1945. The University Library has developed the technical infrastructure for data acquisition, data processing and data visualisation. Based on a complex data model for describing different types of media (especially films, radio broadcasts and print editions, but also theatre performances, records, television broadcasts, advertising material) as a manifestation of literary material, the different linkages have to be worked out and visualized by visualization techniques. The lecture introduces the used approaches, methods and their implementation in detail and also reports on the use of these techniques in other digital services of the University Library. This includes in particular analyses of the global publication network in the context of bibliometric information.

Prof Dr Douglas M BatesStatistics
University of Wisconsin - Madison
The R-Project

  Programming Languages in Data Science

Date:February 14th, 2020
Time: 11:00 a.m. (st)
Location: C2-136

At present R and Python are the most widely used programming languages for data science. R, based on the earlier language S, developed at Bell Labs in the 1980's, is, by design, intended for data analysis and graphics applications.

Python, developed in the 1990's for general applications, has seen widespread adoption in the scientific computing and data science communities. Both of these are dynamic languages that can be used interactively in a REPL (read-eval-print-loop) allowing for rapid prototyping of algorithms and on-the-fly analyses. Both languages also have well-developed infrastructure including integrated development environments and thousands of user-contributed packages available on repositories. Both languages are Open Source and freely available.

As a member of the core development team for R I have considerable experience using R and developing packages for R. I have also used Python though not to the same extent. Despite a long history with these languages I recently switched my development efforts to Julia, primarily for flexibility.

I will compare and contrast these languages and explain why I choose to use the less mature but more flexible one for data science.

Lecture Series 2018/19

The talks during the winter term 2018/19:

Prof Dr Christiane FuchsData Science
Bielefeld University

  Improving Health through Data Science

Date: Friday, October 19th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)

Data science refers to the acquisition of knowledge from data. The field is increasingly important as more and more data is being collected in a rising number of areas and has great potential for optimizing processes. One of these areas is medicine. Medical data is growing in volume and complexity, for example through the accelerated digitization of patient data or improved high-throughput technologies in molecular biology. Processing such data requires both statistical and computational expertise as well as medical domain knowledge for correct interpretation.

I will present projects in which interdisciplinary cooperation between statistics / data science and medicine / biology led to a gain in knowledge and thus to improved diagnostics: On the one hand from the field of risk prediction for prostate cancer patients and childhood asthma; here we were able to use and combine machine learning techniques in such a way that the complex data base available was used more effectively and we were able to improve previously used forecasting methods. On the other hand, for the detection of transcriptional heterogeneity in leukemia patients; here, we use an RNA measurement method that sequences small amounts of cells rather than single cells to reduce cost, effort, unwanted effects of cell isolation and, most importantly, technical errors. In the case of heterogeneous cell populations, the resulting data is a tangle of signals. A statistical algorithm developed by us can extract the single-cell information again.

Unfortunately there is no recording of this lecture available. Our apologies!
Prof Dr Philipp CimianoSemantic Computing
Bielefeld University

  Supporting Meta-Analysis with Semantic Technologies: the case of managing pre-clinical evidence for the treatment of spinal cord injuries

Date: Friday, December 14th
Time: 11:00 a.m. (s.t.)
Location: C2-136

In many knowledge-intensive areas, experts are overwhelmed with information. As the number of publications in many fields is growing at an exponential rate, it is becoming harder and harder for experts to keep up and distill the "evidence" or knowledge from the available published knowledge. New approaches to structuring and managing evidence to support insight generation and answering of specific questions are crucial. Machines can support the task of structuring the available evidence. But there are a number of challenges to face. First, knowledge is published in unstructured form, so we need to teach machines to extract the relevant insights from published articles. Second, we need ontologies or knowledge representation approaches to represent the knowledge to support cross-document aggregation as rarely only one document contains the relevant answer or insight to a question.

We present results of the BMBF-funded project PSINK that seeks to develop a novel approach to systematizing evidence in the field of medicine, in particular in the area dealing with the treatment of spinal cord injuries, for which no successful treatment exists as of today. The project builds on natural language processing techniques and semantic technologies to develop a knowledge base that eventually should contain all the pre-clinical knowledge available on the efficacy of spinal cord injury therapies. This will support a novel concept what we call "meta-analysis on demand".

Unfortunately no slides are available for this lecture. Our apologies!

Tip: Click on "View in Panopto" for more video features!

Dr Alexander SczyrbaComputational Metagenomics
Bielefeld University

  A Glimpse into Microbial Dark Matter through the de.NBI Cloud

Date: Friday, January 18th
Time: 11:00 a.m. (s.t.)
Location: C2-136

Microorganisms are the most abundant cellular life forms on Earth, occupying even the most extreme environments. The large majority of these organisms have not been obtained in pure culture and therefore have long been inaccessible to genome sequencing, which would provide blueprints for the evolutionary and functional diversity that shapes our biosphere. Since the advent of next-generation sequencing technologies, metagenomics and single cell genomics can shed light on the uncharted branches of the tree of life. While representing very complementary approaches, both technologies can recover microbial genomes from environmental samples, with their own strengths and weaknesses.

More than 100,000 metagenomic datasets of hundreds of Terabytes in size are currently available in public data repositories and can be mined for new representatives of candidate phyla to obtain genomes of the underrepresented branches of the tree of life. Also, these metagenomic data sets are invaluable in bioprospecting, an approach for screening for molecules and activities from environmental samples with biotechnological potential. However, for many small research labs these data remain inaccessible due to the lack of computational resources.

Cloud computing offers a solution, as it provides compute and storage capacities at scale. The CeBiTec at Bielefeld University is operating an OpenStack-based cloud computing infrastructure for the life science community within the German Network for Bioinformatics Infrastructure (de.NBI). The de.NBI Cloud (https://cloud.denbi.de/) is a full academic cloud federation, providing compute and storage resources free of charge for academic users. It provides a powerful IT infrastructure in combination with flexible bioinformatics workflows and analysis tools to the life science community in Germany. In my presentation I will show how the de.NBI Cloud can close the gap of missing computational resources for metagenomics research as one example.

Unfortunately no slides are available for this lecture. Our apologies!

Tip: Click on "View in Panopto" for more video features!

Prof Dr Karsten LübkeMathematical Economics and Statistics
FOM University of Applied Science

  Data, modeling, inferential and computational thinking: components in data literacy education

Date: Friday, January 25th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)

Statistical thinking and computational skills together with real world applications are regarded as fundamental elements in data literacy education. Data modeling and simulation based inference may facilitate conceptual understanding in all domains of data literacy. This can be achieved by a re-thinking of the consensus based curriculum. It is time to start - and for a first review of the lessons learned.

Tip: Click on "View in Panopto" for more video features!

Prof Dr Claus WeihsComputer-Supported Statistics
TU Dortmund

  Data Science on music data

Date: Wednesday, January 30th
Time: 10:00 a.m. (s.t.)
Location: C2-136

In this talk we structure the field of data science and substantiate our key premise that statistics is one of the most important disciplines in data science and the most important discipline to analyze and quantify uncertainty. As an application, the talk demonstrates data science methods on music data for automatic transcription and automatic genre determination, both on the basis of signal-based features from audio recordings of music pieces.

Claus Weihs und Katja Ickstadt (2018):
Data Science: The Impact of Statistics;
International Journal of Data Science and Analytics 6, 189-194

Claus Weihs, Dietmar Jannach, Igor Vatolkin und Günter Rudolph (2017):
Music Data Analysis: Foundations and Applications;
CRC Press, Taylor & Francis, 675 pages

Tip: Click on "View in Panopto" for more video features!

Dr Jan GoebelGerman Socio- Economic Panel Study
German Institute for Economic Research

  An introduction to the Socio-Economic Panel (SOEP) Study

Date: Wednesday , May 22nd
Time: 10:15 a.m.
Location: X-E0-220

Nearly 15,000 households and about 30,000 persons participate in the SOEP survey. The SOEP provides a broad set of self-reported "objective" variables, such as income, age, gender, education, employment status, or gripping force, and a broad set of self-reported "subjective" variables, such as from satisfaction with life, over fairness and reciprocity perceptions to psychological measurement like the "Big Five."

Running for already 35 years, SOEP gathers information from a spectrum of birth cohorts. As such, it is a valuable empirical basis for researchers to explore long-time societal changes; relationships between early life events on later life outcomes; interdependencies between the individual and the family or household; mechanisms of inter-generational mobility and transmission; accumulation processes of resources; short- and long-term effects of institutional change and policy reforms; speed of convergence between East and West or between migrants and natives.

The talk will give an overview about the basic features of the SOEP - from the basic sampling strategy to the structure of the released data. How external users can access the data and in which ways the SOEP data can be enriched using auxiliary datasets like geocoded data. Addtionally it will also give an overview about the SOEP Innovation Sample (SOEP-IS) and how external researchers can submit proposals. SOEP-IS can accommodate not only short-term experiments but also longer-term survey modules that are not suitable for SOEP-Core, whether because the survey instruments are still relatively new or because of the specific issues dealt with in the research.

Tip: Click on "View in Panopto" for more video features!

Lecture Series 2017/18

In the 2017/18 winter term BiCDaS offered a lecture series on selected Data Science topics. Six different experts delivered presentations that provided insight into their research in various fields of data science.

Although the lecture series is over, you can still watch most talks below. It is a great collection of resources and showcases the wide range of topics in Data Science.

Dr Silke SchwandtMedival History
Bielefeld University

Data Science and Digital Humanities, Models, Practices, and Interpretations

Tip: Click on "View in Panopto" for more video features!

Prof Dr Johannes BlömerCodes and Cryptography
University of Paderborn

Cryptographic Access Control—Reconciling Security and Privacy

Tip: Click on "View in Panopto" for more video features!
Dr Thomas HermannAmbient Intelligence
Bielefeld University

Auditory Data Science

Unfortunately there is no recording of this lecture available. Our apologies!

Dr Odile SauzetStatistical Consulting Center
Bielefeld University

Data? What for? Thoughts around research question based data collection

Tip: Click on "View in Panopto" for more video features!

Prof Dr Achim StreitSteinbuch Centre for Computing
Karlsruhe Institute of Technology

Enabling Data-Intensive Science

Tip: Click on "View in Panopto" for more video features!
Prof Dr Volker MarklDatabase Systems and Information Management
Berlin Big Data Center
TU Berlin

Big Data: Challenges and Some Solutions

Tip: Click on "View in Panopto" for more video features!