Conference Agenda

Session
MS-73: Machine learning in biological and structural sciences
Time:
Friday, 20/Aug/2021:
10:20am - 12:45pm

Session Chair: Rita Giordano
Session Chair: Harold Roger Powell
Location: Terrace 2B

100 2nd floor

Invited: Melanie Vollmar (UK), Sergei Grudinin (France)


Session Abstract

Recently Machine Learning (ML), has become very popular in the fields of structural biology and chemical crystallography, throughout the pipeline from data collection and data processing through to structure solution and refinement. This technique can improve crystal structure prediction and classification, while ML and its tools (for example deep learning) have also been applied, inter alia, to drug discovery, powder diffraction and materials science. Experts in the field will discuss the background and recent advances in ML as applied to structural science.

For all abstracts of the session as prepared for Acta Crystallographica see PDF in Introduction, or individual abstracts below.


Introduction
Presentations
10:20am - 10:25am

Introduction to session

Rita Giordano, Harold Roger Powell



10:25am - 10:55am

Predicting experimental phasing success for data triaging

Melanie Vollmar1, Irakli Sikharulidze1, Gwyndaf Evans1,2

1Diamond Light Source, Didcot, United Kingdom; 2Rosalind Franklin Institute, Didcot, United Kingdom

Over the recent years there have been large advances in technologies at synchrotron facilities. Photo-counting detectors with high frame rates (several hundred fps) allow for rapid data acquisition. Robotic sample exchangers combined with automated sample centring enable high-throughput sample screening. Fully automated and unattended data collection set-ups offer the possibility to rapidly gather data. Taken together, all these technologies produce vast amounts of data which need to be analysed and stored. Even for an expert crystallographer it can now be very challenging to assess the data gathered during an experimental session. For novel or non-expert users, the data amounts may even feel overwhelming. Additionally, many research groups do not have access to high-performance computing infrastructure or large storage space to keep their data and analyse it and for research facilities like synchrotrons this infrastructure is limited too.

Here we present some initial results for a machine learning-based triaging system which is currently being trialled at Diamond. The aim is to refine the current brute-force experimental phasing pipelines by introducing data driven triage and decision making. The system as it is in place, relies on data fulfilling certain metrics thresholds before being triggered and executing a number of experimental phasing programs in parallel. Each of these programs can run hours and up to a day before producing an output without a guaranteed success. Based on our initial results presented here, we now propose a machine learning-based decision maker which will estimate the chances of successful experimental phasing for the different software packages available within Diamond's automated data analysis pipelines. The outcome of the classification process is then used to execute subsets in the pipelines in a hieararchical fashion.



10:55am - 11:25am

Deep learning entering the post-protein structure prediction era : new horizons for structural biology

Sergei Grudinin

Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

The potential of deep learning has been recognized in structural bioinformatics for already some time, and became indisputable after the CASP13 (Critical Assessment of Structure Prediction) community-wide experiment in 2018. In CASP14, held in 2020, deep learning has boosted the field to unexpected levels reaching near-experimental accuracy. Its results demonstrate dramatic improvement in computing the three-dimensional structure of proteins from amino acid sequence, with many models rivalling experimental structures. This success comes from advances transferred from several machine-learning areas, including computer vision and natural language processing. At the same time, the community has developed methods specifically designed to deal with protein sequences and structures, and their representations. Novel emerging approaches include (i) geometric learning, i.e. learning on non-regular representations such as graphs, 3D Voronoi tessellations, and point clouds; (ii) pre-trained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of big data, e.g. large meta-genome databases; (v) combining protein representations; (vi) and finally truly end-to-end architectures, i.e. single differentiable models starting from a sequence and returning a 3D structure. These observations suggest that deep learning approaches will also be effective for a range of related structural biology applications that will be discussed in this lecture.



11:25am - 11:45am

How machine learning can supplement traditional quality indicators - and the human eye: A case study

Andrea Thorn1, Kristopher Nolte1, Yunyun Gao1, Sabrina Stäb1, Philip Kollmannsberger2

1Universität Hamburg, Germany; 2Julius-Maximilians-Universität Würzburg, Germany

Detecting the ice diffraction artifacts in single-crystal datasets can be very difficult once the data have been integrated, scaled and merged. Automatic tools are available in CTRUNCATE [1], phenix.xtriage [2] and AUSPEX [3]. Recently, the AUSPEX icefinder score was improved by Moreau and colleagues [4]. Automatic recognition of these artifacts would be highly beneficial as macromolecular structure determination can be negatively impacted or even completely hindered by ice diffraction, but remains difficult.

In 2017, we have shown that inspection of plots of merged intensities against resolution permit an easy identification of ice ring contamination in integrated data sets - by eye. However, this approach could be matched by automatic routines. This has led us to attempt identification using convolutional neural networks, which are exceptionally suited to classification of multi-dimensional arrays because they can retain spatial information of the input.

Here, we present our results to employ convolutional neural networks to detect ice artefacts in processed macromolecular diffraction data, resulting in a new automatic detection called “Helcaraxe”. which outperforms previous indicators. We will also discuss the scope this may offer for the structural biology community to tap into the vast amount of data the field has accumulated in 50 years of deposition to the Protein Data Bank.

Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C., & Zwart, P. H. (2010). PHENIX: A comprehensive Python-based system for macromolecular structure solution. Acta Cryst. D66, 213–221. https://doi.org/10.1107/S0907444909052925

Moreau, D. W., Atakisi, H., & Thorne, R. E. (2021). Ice in biomolecular cryocrystallography. Acta Cryst. D77, 540–554. https://doi.org/10.1107/S2059798321001170

Thorn, A., Parkhurst, J., Emsley, P., Nicholls, R. A., Vollmar, M., Evans, G., & Murshudov, G. N. (2017). AUSPEX: A graphical tool for X-ray diffraction data analysis. Acta Cryst. D73, 729–737. https://doi.org/10.1107/S205979831700969X

Winn, M. D., Ballard, C. C., Cowtan, K. D., Dodson, E. J., Emsley, P., Evans, P. R., Keegan, R. M., Krissinel, E. B., Leslie, A. G. W., McCoy, A., McNicholas, S. J., Murshudov, G. N., Pannu, N. S., Potterton, E. A., Powell, H. R., Read, R. J., Vagin, A., & Wilson, K. S. (2011). Overview of the CCP 4 suite and current developments. Acta Cryst. D67, 235–242. https://doi.org/10.1107/S0907444910045749



11:45am - 12:05pm

Learning structure-energy relationships for the prediction of molecular crystal structures

Graeme M Day

University of Southampton, Southampton, United Kingdom

The discovery of new functional materials can be guided by computational screening, particularly if the structure of a material can be reliably predicted from its chemical composition. For this application, we have been developing the use energy-structure-function maps [1] of the crystal structures available to a molecule. These maps help understand the properties of predicted crystal structures and their energetic stabilities. However, the use of these methods is still limited by the computational cost of crystal structure prediction (CSP), most of which is associated with the calculation of the relative energies of predicted crystal structures using energy models that are sufficiently accurate to provide reliable energetic rankings. To accelerate these methods, we have been developing machine learning approaches to predict high quality energies (e.g. from solid state density functional theory) from structures that have been generated with computationally efficient energy models. These approaches rely on statistical models, in our case Gaussian Process Regression, to relate lattice energies to geometric descriptors of crystal structures. The talk will discuss two approaches that we have developed: learning of total energies calculated using solid state density functional theory [2,3], and a fragment-based approach [4] where we learn high level dimer energies, which are used to build up the total lattice energies of predicted structures.

[1] Day, G. M. and Cooper, A. I. (2018) Adv. Mater., 30, 1704944.

[2] Musil, F, De, S., Yang, J., Campbell, J. E., Day, G. M. and Ceriotti, M. (2018) Chem. Sci., 9, 1289-1300.

[3] Egorova, E., Hafizi, R., Woods, D. C. and Day, G. M. (2020) J. Phys. Chem. A, 124 , 8065–8078.

[4] McDonagh, D., Skylaris, C.-K. and Day, G. M. (2019) J. Chem. Theory Comput., 15, 2743–2758



12:05pm - 12:25pm

New generalized crystallographic descriptors for structural machine learning

James Cumby, Sohan Seth, Ruizhi Zhang

University of Edinburgh, Edinburgh, United Kingdom

The ever-growing amount of crystallographic data offers the potential to uncover a range of scientific discoveries, from rapidly predicting physical properties to suggesting new materials with desirable functional behaviours. This is further enhanced by the current growth in machine learning (ML) algorithm development and implementation. There is, however, a significant obstacle to this goal; standard crystallographic information are not suitable inputs for ML algorithms. This arises due to the inherent flexibility of crystallography, such as non-unique unit cell definitions and symmetry. To overcome this problem, significant progress has been made in devising ‘descriptors’ for crystallographic ML, compressing and standardising crystallographic information into a smaller feature space. Much of the existing focus has been on molecular crystals, where the finite extent of individual molecules imposes a limit on the size of feature vector required. A large number of approaches have been proposed but do not easily extrapolate to extended (i.e. inorganic) materials. [1] The descriptors that are suitable for extended solids tend to be either hand-crafted for a specific problem, or have so many dimensions that extremely large datasets must be used to train reliable ML models. In addition, many do not scale well with variable numbers of atomic species.

Here, we present two new descriptors for crystallographic materials which are generally applicable and invariant to compositional complexity. The first is based on a real-space view of the structure, the second on a reciprocal (or diffraction) space view. Both descriptions are invariant to atomic permutations and unit cell choice, and can be considered as an ‘extended’ (i.e. more information-rich) version of the atomic radial distribution function (RDF) and powder diffraction pattern, respectively. The more complete features offered by these descriptors results in better physical property predictions. For example, our ‘extended’ RDF can predict bulk modulus from crystal structures obtained from the Materials Project [2] with a much lower error than the ‘simple’ RDF using linear ridge regression (Figure 1). It is notable that the error approaches current state-of-the-art results, [3] without any knowledge of the atom types involved.

[1] Rossi, K. & Cumby, J. (2020). Int. J. Quantum Chem., 120, e26151. [2] Jain, A., Ong, S. P., Hautier, G., et al. (2013). APL Mater., 1(1), 011002. [3] Chen, C., Ye, W., Zuo, Y. et al. (2019). Chem. Mater., 31, 3564.



12:25pm - 12:45pm

Analysis of pre-edge XANES spectra of Fe:SiO4 system by using machine learning methods.

Danil Pashkov, Alexander Guda, Sergey Guda, Alexander Soldatov

Southern Federal University, Rostov-on-Don, Russian Federation

The x-ray absorption near-edge structure (XANES) spectra of some nano-structures exhibit small peaks when the incident x-ray energy is lower than the main absorption edge energy. The energies of these peaks depend on local environment, valency of chemical elements and density of electronic states. Advanced quantitative analysis of the local atomic geometry around active catalytic sites requires novel experimental method e.g. the pre-edge structure of X-ray absorption near edge spectra (XANES) measured in the high-energy resolution fluorescence detected mode, the so-called HERFD-XANES. However, there is no widely used ab initio theoretical method which could be routinely applied to the analysis of such experimental data except parametric multiplet calculations. To overcome the procedure of adjusting of parameters is the using of local DFT Hamiltonian constructed on the basis of Wannier orbitals – the so called multiplet ligand-field theory (MLFT) [1]. Pre-edge region of X-ray absorption spectra could be calculated using the XTLS code in the framework of multiplet ligand-field theory using maximally localized Wannier functions (MLWF).

Computation of pre-edge XANES spectra according to MLFT approach is a complicated process, which requires using a lot of software, such as: Wien2k, Wannier90, XTLS code and some additional programs and scripts. We developed «w2auto» program, which automates all process of pre-edge XANES computation. «w2auto» emulates work in w2web interface of Wien2k software and provides opportunity to run all necessary programs without user access. The launch of the necessary calculation steps is controlled through the configuration script in Python programming language. Also we developed a simple GUI for users who does not have any experience in programming in Python language. It helps to generate configure file in form of Python script.

In recent years machine learning has become a powerful instrument for solving scientific problems. It helps to classify and sort data, make approximations, find latent dependencies and features. In this work we have applied machine learning methods for analysis of the Fe:SiO4 pre-edge XANES spectra. As recently shown,

machine learning methods have been successfully applied to the quantitative analysis of spectroscopic data in general and of X-ray near edge spectroscopy (XANES) in particular [2-4].

In the present work we show applicability of machine learning methods to retrieve structural information in system Fe:SiO4. In this research we have collected 60 pre-edge XANES spectra in differrent coordination (from 2-fold to 6-fold) and oxidation states (Fe2+ and Fe3+) using «w2auto» program. We used this dataset to train and validate several machine learning methods (Decision Tree, ExtraTrees, SVM, Logistic regression and neural network) to determine both coordination number and oxidation state by spectrum.

Acknowledgment

The work was supported by grant of President of Russia for young scientists (MK-2730.2019.2).

References

[1] E. Gorelov, A.A. Guda, M.A. Soldatov, S.A. Guda, D. Pashkov, A. Tanaka, S. Lafuerza, C. Lamberti, A.V. Soldatov, MLFT approach with p-d hybridization for ab initio simulations of the pre-edge XANES, Radiation Physics and Chemistry, 2018, DOI: 10.1016/j.radphyschem.2018.12.025.

[2] A. Martini, S. A. Guda, A. A. Guda, G. Smolentsev, A. S. Algasov, O. A. Usoltsev, M. A. Soldatov, A. L. Bugaev, Y. V. Rusalev, A. V. Soldatov, PyFitit: the software for quantitative analysis of XANES spectra using machine learning algorithms, Computer Physics Communications, 2019

[3] A. A. Guda, S. A. Guda, K. A. Lomachenko, M. A. Soldatov, I. A. Pankin, A. V. Soldatov, L. Braglia, A. L.Bugaev, A. Martini, M. Signorile, E. Groppo, A. Piovano, E. Borfecchia, C. Lamberti, Quantitative structural determination of active sites from in situ and operando XANES spectra: From standard ab initio simulations to chemometric and machine learning approaches, Catalysis Today, V. 336, 2019, P. 3-21, DOI: 10.1016/j.cattod.2018.10.071.

[4] Guda, A.A., Guda, S.A., Martini, A., Bugaev A., Soldatov, M. A., Soldatov, A. V. & Lamberti, C. (2019). Machine learning approaches to XANES spectra for quantitative 3D structural determination: The case of CO2 adsorption on CPO-27-Ni MOF. Radiation Physics and Chemistry. 108430. DOI: 10.1016/j.radphyschem.2019.108430.