Skip to main content

Data mangement in the modern structural biology and biomedical research environment.

Zimmerman MD, Grabowski M, Domagalski MJ, Maclean EM, Chruszcz M, Minor W (2014) Methods Mol Biol 1140, 1-25. PMCID: PMC4086192

In this book chapter, Data and Dissemination Core researchers reflect back on the implementation of various data management systems in large-scale collaborative groups with structural genomics (SG) components. The public EFI-DB database and internal EFI-LabDB LIMS act as the circulatory system of the EFI, connecting various cores and bridging projects by ensuring the smooth transfer of relevant data as targets move through the discovery pipeline. Dr. Wladek Minor's group has been essential in facilitating the EFI in its continued effort to convert "data to information".


Modern high-throughput structural biology laboratories produce vast amounts of raw experimental data. The traditional method of data reduction is very simple-results are summarized in peer-reviewed publications, which are hopefully published in high-impact journals. By their nature, publications include only the most important results derived from experiments that may have been performed over the course of many years. The main content of the published paper is a concise compilation of these data, an interpretation of the experimental results, and a comparison of these results with those obtained by other scientists.Due to an avalanche of structural biology manuscripts submitted to scientific journals, in many recent cases descriptions of experimental methodology (and sometimes even experimental results) are pushed to supplementary materials that are only published online and sometimes may not be reviewed as thoroughly as the main body of a manuscript. Trouble may arise when experimental results are contradicting the results obtained by other scientists, which requires (in the best case) the reexamination of the original raw data or independent repetition of the experiment according to the published description of the experiment. There are reports that a significant fraction of experiments obtained in academic laboratories cannot be repeated in an industrial environment (Begley CG & Ellis LM, Nature 483(7391):531-3, 2012). This is not an indication of scientific fraud but rather reflects the inadequate description of experiments performed on different equipment and on biological samples that were produced with disparate methods. For that reason the goal of a modern data management system is not only the simple replacement of the laboratory notebook by an electronic one but also the creation of a sophisticated, internally consistent, scalable data management system that will combine data obtained by a variety of experiments performed by various individuals on diverse equipment. All data should be stored in a core database that can be used by custom applications to prepare internal reports, statistics, and perform other functions that are specific to the research that is pursued in a particular laboratory.This chapter presents a general overview of the methods of data management and analysis used by structural genomics (SG) programs. In addition to a review of the existing literature on the subject, also presented is experience in the development of two SG data management systems, UniTrack and LabDB. The description is targeted to a general audience, as some technical details have been (or will be) published elsewhere. The focus is on "data management," meaning the process of gathering, organizing, and storing data, but also briefly discussed is "data mining," the process of analysis ideally leading to an understanding of the data. In other words, data mining is the conversion of data into information. Clearly, effective data management is a precondition for any useful data mining. If done properly, gathering details on millions of experiments on thousands of proteins and making them publicly available for analysis-even after the projects themselves have ended-may turn out to be one of the most important benefits of SG programs.

Link to PubMed »



Figure 1: The architecture of the UniTrack data management system.

Figure 2: Fragment of an experiment tree displayed in the UniTrack-based CSGID interface.

Figure 3: A typical target overview page in the LabDB LIMS.

Figure 4: The fluorescence-based thermal shift assay module of the LabDB LIMS, showing the graphical representation of the imported experimental data..

Figure 5: Histogram showing the distribution of maximum mosaicity value (as fit during integration) of diffraction datasets collected on MCSG targets processed at the University of Virginia, as tracked by the hkldb module of HKL-3000m.

Figure 6: Example of a data dashboard: a plot of the cumulative progress for the MCSG center.

Figure 7: Example of a data mining "dashboard": a plot of Rfree vs resolution for structures determined by the CSGID.

Figure 8: Map showing locations of collaborators of the MCSG (Institutions of scientists who coauthored papers funded at least in part by the center).

Reprinted with permission from Zimmerman et al, Methods in Molecular Biology, Vol 1140, p 1-25. Copyright 2014 Springer.