# Generalized Correlation for Biomolecular Dynamics

Oliver F. Lange and Helmut Grubmüller

#### Generalized versus Pearson correlation coefficients Above diagonal: Matrix of generalized correlation coefficients for atomic motion of T4 lysozyme. Below diagonal: Matrix of Pearson correlation coefficients for same motion. Note, that the patterns below the diagonal are due to artifacts of this established measure.

#### Correlated motion of domains (Generalized) correlated motion of domains mapped onto structure of lysozyme. The two domains (red) move highly correlated.

#### Pearson correlation measure

The first trial, which generalized the Pearson correlation coefficient, rests on calculation of the normalized covariance matrix of atomic fluctuations, where and are the positional fluctuation vectors of atoms and , respectively, in the molecular fixed frame. This established approach, however, misses a considerable fraction of the correlated motions and, therefore, usually underestimates atomic correlations [Lange, 2006]. This limitation is mainly due to three assumptions:

• First, estimates of correlations from the Pearson coefficient are only strictly valid if and are co-linear vectors.
• Second, a linear approximation. Thus, the Pearson correlation coefficient misses non-linear correlations.
• Third, the measure is not well-defined, because it is not invariant to rescaling.

#### Generalized Correlation in a simple picture Imagine you observed the gray joint distribution of two variables. From this compute the hypothetical  joint distribution (black), you would observe if the variables were uncorrelated. The volume difference (quantified in terms of entropy) of both distributions, the gray and the black, gives you the generalized correlation measure.

#### Generalized Correlation

• any correlation is captured
• sound information theoretical basis
• scaling invariant
• can consistently be used for measuring correlation between groups of atoms of any size
• linearized version exists, and allows to separate purely non-linear from linear correlations

The generalized correlation measure rests on the fundamental definition of independence of random variables. Accordingly, two random variables are independent, if and only if their joint distribution is a product of their marginal distributions, .

The basic idea is to quantify the correlation between variables X, Y as the deviation between both sides of the above equation, i.e., by the deviation from the case of two independent random variables (see figure). This is done by mutual information, as laid out in [Lange, 2006].

#### Software

We contributed the tool g_correlation to the GROMACS framework, which allows to compute both, linear or non-linear genearlized correlation coefficients. You further need to install GROMACS if you have not already done so. Read the file INSTALL instructions. In the subdir mfiles you will find some scripts for MATLAB. read_blitz.m allows you to read the *.dat output of g_correlation. This gives you a matrix of the correlation coefficients in MATLAB. To plot a matrix as shown above, you can use plot_corr_matrix.m. If you have any questions, feel free to contact me.

Changelog:

• Ver 0.x: C++ version, abandonded due to several reports of installation problems
• Ver 1.x: C-version
• Ver 1.0.1: added Makefile_gmx321 to allow simple installation together with gromacs 3.2.1
• Ver 1.0.2: removed problem with MPI due to deprecated \#define statement (MPI job exits with signal 11)

The software is free for everyone. However, if you use it for publications or presentations you should cite the original publication [Lange, 2006]. The current version applies an algorithm from , which should be cited, too. Please note that the software is distributed with NO WARRANTY OF ANY KIND. The author is not responsible for any losses or damages suffered directly or indirectly from the use of the software. Use it at your own risk. Please send your bug reports, comments and suggestions to Oliver Lange! Enjoy!

Go to Editor View