Generalized Correlation for Biomolecular Dynamics

Oliver F. Lange and Helmut Grubmüller

Correlated motions in biomolecules, in particular proteins, are ubiquitous and often essential for biomolecular function. Correct assessment of correlated motions, both experimentally and from theory and simulations, is therefore crucial for a quantitative understanding of biomolecular function. The accurate characterization of correlated motions would also improve the interpretation of NMR experiments and X-ray diffusive scattering data. Here we describe how to obtain correlations from MD simulations. Any experiment which probes correlations in the motion of pairs of atoms would do so in a way which is invariant to the definition of the Cartesian coordinate system. Therefore, we need to obtain a measure, which is also invariant to the chosen coordinate system.

Generalized versus Pearson correlation coefficients

Above diagonal: Matrix of generalized correlation coefficients for atomic motion of T4 lysozyme. Below diagonal: Matrix of Pearson correlation coefficients for same motion. Note, that the patterns below the diagonal are due to artifacts of this established measure.

Correlated motion of domains

(Generalized) correlated motion of domains mapped onto structure of lysozyme. The two domains (red) move highly correlated.

Pearson correlation measure

The first trial, which generalized the Pearson correlation coefficient, rests on calculation of the normalized covariance matrix of atomic fluctuations,where and

are the positional fluctuation vectors of atoms and , respectively, in the molecular fixed frame. This established approach, however, misses a considerable fraction of the correlated motions and, therefore, usually underestimates atomic correlations [Lange, 2006]. This limitation is mainly due to three assumptions:

    • First, estimates of correlations from the Pearson coefficient are only strictly valid if and are co-linear vectors.
    • Second, a linear approximation. Thus, the Pearson correlation coefficient misses non-linear correlations.
    • Third, the measure is not well-defined, because it is not invariant to rescaling.

    Generalized Correlation in a simple picture

    Imagine you observed the gray joint distribution of two variables. From this compute the hypothetical  joint distribution (black), you would observe if the variables were uncorrelated. The volume difference (quantified in terms of entropy) of both distributions, the gray and the black, gives you the generalized correlation measure.

    Generalized Correlation

    • any correlation is captured
    • sound information theoretical basis
    • scaling invariant
    • can consistently be used for measuring correlation between groups of atoms of any size
    • linearized version exists, and allows to separate purely non-linear from linear correlations

    The generalized correlation measure rests on the fundamental definition of independence of random variables. Accordingly, two random variables are independent, if and only if their joint distribution is a product of their marginal distributions,

    .

    The basic idea is to quantify the correlation between variables X, Y as the deviation between both sides of the above equation, i.e., by the deviation from the case of two independent random variables (see figure). This is done by mutual information, as laid out in [Lange, 2006].

    Software

    We contributed the tool g_correlation to the GROMACS framework, which allows to compute both, linear or non-linear genearlized correlation coefficients. You further need to install GROMACS if you have not already done so. Read the file INSTALL instructions. In the subdir mfiles you will find some scripts for MATLAB. read_blitz.m allows you to read the *.dat output of g_correlation. This gives you a matrix of the correlation coefficients in MATLAB. To plot a matrix as shown above, you can use plot_corr_matrix.m. If you have any questions, feel free to contact me.


    Changelog:

    • Ver 0.x: C++ version, abandonded due to several reports of installation problems
    • Ver 1.x: C-version
    • Ver 1.0.1: added Makefile_gmx321 to allow simple installation together with gromacs 3.2.1
    • Ver 1.0.2: removed problem with MPI due to deprecated \#define statement (MPI job exits with signal 11)

    The software is free for everyone. However, if you use it for publications or presentations you should cite the original publication [Lange, 2006]. The current version applies an algorithm from [4], which should be cited, too. Please note that the software is distributed with NO WARRANTY OF ANY KIND. The author is not responsible for any losses or damages suffered directly or indirectly from the use of the software. Use it at your own risk. Please send your bug reports, comments and suggestions to Oliver Lange! Enjoy!

    Go to Editor View