Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
MS-82: Handling of big data in crystallography
Time:
Friday, 20/Aug/2021:
2:45pm - 5:10pm

Session Chair: Wladek Minor
Session Chair: Brinda Vallat
Location: 223-4

60 2nd floor

Introduction
Presentations
2:45pm - 2:50pm

Introduction to session

Wladek Minor, Brinda Vallat



2:50pm - 3:20pm

IRRMC (https:// proteindiffraction.org): Impact on quality of structures in PDB

Marek Grabowski, Marcin Cymborowski, David Cooper, Wladek Minor

UNIVERSITY OF VIRGINIA, Charlottesville, United States of America

Preservation and public accessibility of primary experimental data are cornerstones necessary for the reproducibility of empirical sciences. Many crystallography journals recommend that authors of manuscripts presenting a crystal structure deposit their primary experimental data (X-ray diffraction images) to one of the dedicated resources created in recent years. We present the Integrated Resource for Reproducibility in Molecular Crystallography (IRRMC). In its first five years, several hundred crystallographers have deposited over 9000 datasets representing more than 5,700 diffraction experiments performed at over 60 different synchrotron beamlines or home sources all over the world. We describe several examples of the crucial role that diffraction data can play in improving previously determined protein structures. In addition to improving the resource and annotating and curating submitted data, we have been building a pipeline to extract or generate the metadata necessary for seamless, automated processing. Preliminary analysis shows that about 95% of the data received by our resource can be automatically reprocessed. A high rate of reprocessing success shows the feasibility of automated metadata extraction and automated processing as a validation step that ensures the correctness of raw diffraction images. The IRRMC is guided by the Findable, Accessible, Interoperable, and Reusable data management principles. Data from IRRMC have already enabled several novel research projects.



3:20pm - 3:50pm

A Gold Standard for the archiving of macromolecular diffraction data

Herbert J. Bernstein1, Andreas Förster2, Aaron S. Brewster3, Graeme Winter4

1Ronin Institute for Independent Scholarship, c/o NSLS II, Brookhaven National Laboratory, Upton, NY, USA; 2DECTRIS Ltd., Täfernweg 1,5405 Baden-Dättwil, CH; 3Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA; 4Diamond Light Source Ltd, Harwell Science and Innovation Campus, Didcot OX11 0DE, UK

Macromolecular crystallography (MX) is the dominant means of determining the three-dimensional structures of biological macromolecules. Over the last few decades, most MX data have been collected at synchrotron beamlines using a large number of different detectors produced by various manufacturers and taking advantage of various protocols and goniometry. These data came in their own formats, some proprietary, some open. The associated metadata rarely reached the degree of completeness required for data management according to Findability, Accessibility, Interoperability and Reusability (FAIR) principles. Efforts to reuse old data by other investigators or even by the original investigators some time later were often frustrated.

In the culmination of an effort dating back more than two decades, a large portion of the research community concerned with High Data-Rate Macromolecular Crystallography (HDRMX) agreed in 2020 to an updated specification of data and metadata for diffraction images produced at synchrotron light sources and X-ray free electron lasers (XFELs) [1]. This Gold Standard builds on the NeXus/HDF5 NXmx application definition and the International Union of Crystallography (IUCr) imgCIF/CBF dictionary and is compatible with major data processing programs and pipelines. It will ensure effortless automatic data processing, facilitate manual reprocessing of data independent of the facility at which they were collected, and enable data archiving according to FAIR principles, with a particular focus on interoperability and reusability.

Direct consequences of the Gold Standard are an unambiguous definition of the experimental geometry, a record of the synchrotron and beamline where the data were collected, and additional optional metadata that will make subsequent submission of the structural model to the PDB more straightforward. Just as with the IUCr CBF/imgCIF standard from which it arose and to which it is tied, the Gold Standard is intended to be applicable to all detectors used for crystallography. In particular, the application of the Gold Standard does not require the use of HDF5. Corresponding metadata definitions exist in CBF/imgCIF. All hardware and software developers in the field are encouraged to adopt and contribute to the standard.

The Gold Standard provides a convenient and consistent way to record the essential minimal data and metadata needed to process a wide range of macromolecular diffraction experiments including single axis, single crystal rotation experiments using single-module detectors, XFEL serial crystallography experiments using powerful multi-module detectors producing tens of thousands of images from huge numbers of small crystals, as well as synchrotron experiments producing large number of wedges from micro-crystals. Examples from all of these and more will be discussed.

[1] Bernstein, H.J., Förster, A., Bhowmick, A., Brewster, A.S., Brockhauser, S., Gelisio, L., Hall, D.R., Leonarski, F., Mariani, V., Santoni, G., Vonrhein, C. and Winter, G. (2020). Gold Standard for macromolecular crystallography diffraction data. IUCrJ, 7(5) 784 -- 792.

The work was supported in part by funding from Dectris Ltd., from the U. S. Department of Energy (BES KP1605010, KP1607011, DE-SC0012704), from the U. S. National Institutes of Health (NIGMS P30GM133893, R01GM117126).



3:50pm - 4:15pm

Data evaluation on the fly: Auto-Rickshaw at the MX beamlines of the Australian Synchrotron

Santosh Panjikar

Australian Synchrotron, ANSTO, Clayton, Australia

Auto-Rickshaw [1,2] is a system for automated crystal structure determination. It provides computer coded decision-makers for successive and automated execution of a number of existing macromolecular crystallographic computer programs thus forming a software pipeline for automated and efficient crystal structure determination.

Auto-Rickshaw (AR) is freely accessible to the crystallography community through the EMBL-Hamburg AR Server [3].

Recently, it has been installed at the ASCI cluster at the Australian Synchrotron which uses Docker and Kubernetes system for launching AR jobs in high-throughtput manner. The synchrotron AR server is accessible to users from the MX beamline computers.

AR at the MX beamlines can be invoked through command line or a web-based graphical user interface (GUI) for data and parameter input and for monitoring the progress of structure determination. It can be also invoked via automatic data processing if the parameter inputs have been pre set at the AR-GUI during X-ray diffraction experiment.

A large number of possible structure solution paths are encoded in the system and the optimal path is selected as the structure solution evolves. The platform can carry out experimental (SAD, SIRAS, RIP or various MAD) and MR phasing or combination of experimental and MR phasing. The system has extended extensively for evaluation of multiple datasets for various phasing protocols as well as for evaluation of ligand binding and fragment screening.

The new implementation and features will be discussed during the presentation.

References

[1] Panjikar, S., Parthasarathy, V., Lamzin, V. S., Weiss, M. S. & Tucker, P. A. (2005). Auto-Rickshaw - An automated crystal structure determination platform as an efficient tool for the validation of an X-ray diffraction experiment. Acta Cryst. D61, 449-457.

[2] Panjikar, S., Parthasarathy, V., Lamzin, V. S., Weiss, M. S. & Tucker, P. A. (2009). On the combination of molecular replacement and single-wavelength anomalous diffraction phasing for automated structure determination Acta Cryst. D65,1089-1097.

[3] http://www.embl-hamburg.de/Auto-Rickshaw



4:15pm - 4:40pm

Rapid response to biomedical challenges and threats

Wladek Minor1, Mariusz Jaskolski2, Alexander Wlodawer3, Zbigniew Dauter3, Joanna Macnar4, Dariusz Brzezinski5, David Cooper1, Marcin Kowiel7, Miroslaw Gilski2, Ivan Shabalin1, Marek Grabowski1, Bernhard Rupp6

1University of Virginia, Charlottesville, United States of America; 2A. Mickiewicz University, Poznan, Poland; 3National Cancer Institute, United States of America; 4University of Warsaw, Warsaw, Poland; 5Poznan University of Technology, Poznan, Poland; 6k.-k Hofkristallamt, United States of America; 7Polish Academy of Sciences, Poland

Structural information, mainly derived by X-ray crystallography and Cryo-Electron Microscopy, is the quintessential prerequisite for structural-guided drug discovery. However, accurate structural information is only one piece of information necessary to understand the big picture of medical disorders. To provide a rapid response to emerging biomedical challenges and threats like COVID-19, we need to analyze medical data in the context of other in-vitro and in-vivo experimental results. Recent advancements in biochemical, spectroscopical, and bioinformatics methods may revolutionize drug discovery, albeit only when these data are combined and analyzed with effective data management framework like Advanced Information System proposed in 2017. The progress on AIS is too slow, but creating such a system is a Grand Challenge for biomedical sciences. By definition, a Grand Challenge is a challenging and extremely difficult long-term project that is not always appreciated by those looking for immediate returns.



4:40pm - 5:05pm

Development of an on-the-fly data processing with information-lossless compression for CITIUS detectors at SPring-8

Toshiyuki Nishiyama Hiraki1, Toshinori Abe1,2, Mitsuhiro Yamaga1,2, Takashi Sugimoto1,2, Kyosuke Ozaki1, Yoshiaki Honjo1, Yasumasa Joti1,2, Takaki Hatsui1

1RIKEN SPring-8 Center, Hyogo, Japan; 2Japan Synchrotron Radiation Research Institute, Hyogo, Japan

Diffraction-limited synchrotron radiation sources (DLSRs) using the advanced accelerator technologies deliver high-brilliance X-rays at high repetition rates. The DLSRs are expected to provide X-ray diffraction (XRD) measurements with benefits such as a reduction of the total time required for a complete scan and an improvement of temporal resolution. At the proposed SPring-8-II facility [1], one of the DLSRs, anticipated experiments using XRD techniques require X-ray imaging detectors with a frame rate over 10 kHz, high pixel count, a count rate over 100 Mcps/pixel, and single-photon sensitivity. To meet these demands, we have been developing a high-speed X-ray imaging detector CITIUS (Charge Integration Type Imaging Unit with high-Speed extended-Dynamic-Range Detector) [2] for SPring-8 and SACLA. As for SPring-8, our first milestone is to install a 20M-pixel CITIUS detector in 2023. It has a frame rate of 17.4 kHz and a raw data rate of 1.4 TB/s. Such a high raw data rate demands the careful design of the data handling scheme from the transfer, on-the-fly processing, storage, to post-analysis.

In this presentation, we describe our plan on the data acquisition and analysis scheme and the current status of the development. Our baseline implementation of the data-processing flow is composed of two steps. At the first step, detector images are processed by on-the-fly processing such as accumulation and a veto mechanism, which reduces the peak data-stream rate from 1.4 TB/s to ~400 GB/s. The processing algorithms are implemented onto custom PCB boards (Data Framing Board, DFB). Each DFB has three field-programmable gate arrays (FPGAs). Then generated processed data are transferred via PCI Express 3.0 bus to PC server memory. The second step is to compress the images by PC servers. We are investigating several information-lossless compression algorithms including the one presented in [3]. The peak data rate after the compression is further reduced to ~10 GB/s. The compressed images are to be stored in cache storage with a capacity of about 4-day measurements. The cached data are transferred to the high-performance computing system for post-analysis, and long-term storage. We also present the results of the experiment using an X-ray photon correlation spectroscopy technique. We also present the infrastructure in detail to execute this flow.

[1] “SPring-8-II Conceptual Design Report” (Nov. 2014) http://rsc.riken.jp/eng/pdf/SPring-8-II.pdf. [2] T. Hatsui, “New opportunities in photon science with high-speed X-ray imaging detector Citius, and associated data challenge”, Presentation at the 2nd R-CCS International Symposium (2020) [3] R. Roy et al., the proceedings of CCGrid2021, accepted.