Klauda Lab Simulation Wiki

Pairwise 2D-RDF and Clustering Analysis

Author: Yalun Yu (alanyu17@terpmail.umd.edu; yalun.research@gmail.com)

Bug fix log:

08/24/2020: An asymmetric membrane only bug found in clustering.py (line # 757); status: code fixed and updated, example not updated.

cluster.tar.gz cluster-example.tar.gz

This new code allows pair-specific “Dcut” in clustering analysis. To find the proper pair/lipid-specific “Dcut”, please use the 2D-RDF code included in the same mini-package.

Background

Different lipids have different sizes, so that we want to allow pair/lipid-specific Dcuts. You can compare this to LJ radius, where Rmin = Rmin(a)/2 + Rmin(b)/2 (we call this combination rule. Apparently, this is atom-specific). Or, if you are using a NBFIX (pair-specific interaction) for the LJ interaction between atom type a & b, Rmin=Rmin (NBFIX between a & b). The concept of NBFIX can be found at:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2853947/

The new code is inspired by this. In the code, one can assign each lipid type a lipid-specific half-Dcut, while pair-specific Dcuts are also allowed for selected lipid pairs.

Why this is doable?

The DBSCAN algorithm used in the old clustering code uses the XY coordinates as inputs. However, DBSCAN also allows distance matrix as inputs, which means we can scale the elements in the real distance matrix as we want. For example, we have a matrix:

[Particle] A           B         C     ...
A          0           1.5       1.5
B          1.5         0         2
C          1.5         2         0

If we think A is a smaller particle (a “smaller” lipid in our case), then we might want to remove the bias by dividing the distances between A & other particles with a number smaller than 1, or by dividing the distances between any two particles that are not A by a number larger than 1. The number divided doesn’t matter, the only thing matters is the relative values. So, I use the Dcut(i-j) matrix as the denominator, where i, j can be interpreted as the residue indexes. After getting the scaled distance matrix, I feed it into DBSCAN using an eps of 1.0 to “cluster” them (see sklearn DBSCAN for definition of eps).

How to Use:

(1) Install conda (https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) and create a conda environment named as you want (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands), and install the following packages for the conda environment:

numpy, mdtraj, scikit-learn, openmm, matplotlib

I recommend using bash instead of csh to avoid glitches when using conda! On DT2, you can change to bash by just typing “bash”, you can also change your default shell following the instructions on Deepthought2 website.

A quick one step command could be (after installing conda):

conda create -n cluster -c omnia python=3.6 openmm mdtraj numpy scikit-learn matplotlib

Using this command, we created a conda environment (env) called “cluster” and installed several packages into that environment. To use this env, type “conda activate cluster” in the command line, and you should see your command line changing from [login-1 ~]$ … to (cluster) [login-1 ~]$ …

(2) The RDF code

Two versions of the RDF code are in the “rdf” and “rdf-explicit” subfolders. The two are basically the same, and the only difference between them is the method to separate upper and lower leaflets. In “rdf-explicit”, a coordinate file where the membrane is centered at 0 in the z direction is needed as input (usually step5_assembly.psf from CHARMM-GUI meets this requirement). In the worst case, a coordinate file extracted from a re-centered trajectory can be used. In the other one (“rdf”), the user have to find the last atom index of the upper leaflet from the psf file. My experience is that the “rdf” template is good for membrane containing only phospholipids built from CHARMM-GUI. If you have any lipid built by patching two residues (for example, the glycolipids), then you should use the “rdf-explicit” template. To extract the coordinates from a re-centered trajectory (if you really need to), first get the re-centered trajectory following the standard steps in the EDP section, and then use the extract.inp in “ref-explicit” folder to get the .crd file. The instructions are all in-place within the extract.inp file.

To run the rdf analysis between any two selections, update the corresponding parts in the run_rdf.py file and submit the job by “sbatch run_rdf.sh”. Please refer to http://mdtraj.org/1.9.3/atom_selection.html for correct selection word format.

To results will be saved in the data folder after you run the job. First column is the radius in nm; second column is the avgerage RDF of two leaflets; third/fourth columns are the upper/lower leaflet RDFs.

Note this RDF code also allows self-RDF (just make the two atom selection words the same).

(3) The clustering code

a. In clustering.py:

Update/add the half-Dcut (in nanometer) for each head type. Discuss with Dr. Klauda for these numbers (please do self-RDFs and/or pairwise RDFs before

def radius(head):
    if head == 'pe':
        return 0.29
    elif head == 'pc':
        return 0.29
    elif head == 'pg':
        return 0.29
    ...

Update/add the atoms you want to use to get the (average) X/Y of lipids (you can refer to the old cluster code for this part):

def rpatom(head):
    phos = ['P', 'O11', 'O12', 'O13', 'O14']
    if head.lower() == 'cer':
        return ['NF']
    elif head.lower() == 'sito':
        return ["O3"]
    elif head.lower() == 'pe':
        return phos + ['C11', 'C12', 'N']
    elif head.lower() == 'pc':
        return phos + ['C11', 'C12', 'N']
    elif head.lower() == 'pg':
        return phos + ['C11', 'C12', 'C13', 'OC2', 'OC3']
    elif head.lower() == 'pi':
        return phos + ['C11', 'C12', 'C13', 'C14', 'O4', 'C15', 'C16']
    elif head.lower() == 'ps':
        return phos + ['C11', 'C12', 'C13', 'N']
    else:
        raise Exception('Not anticipated head type: {}'.format(head))

* use lower case for head group name but upper case for atom name.

b. In run_cluster.py, the pair specific Dcut can be implemented by the clfix argument when creating a ClusterLip instance (see the code and you will know what I mean). The minimum number of lipids in a cluster can be modified using the min_lipids argument.

c. Include ALL lipids in your system.

d. Other changes should follow the instructions in the code.

(4) The util.py

a. Make corresponding edits in util.py (could copy from run_cluster.py) before submitting job

b. Submit the clustering analysis by “sbatch run_cluster.sh”

Check the results:

* in topperc.png / botperc.png, you can find the composition of your membrane leaflet (labeled as composition) and the composition of the clusters (labeled ad in cluster). Same info can be found in the topperc.txt / botperc.txt files.

* in resname_vs_time_bot/top.txt, you can find the number of a specific residue type in clusters vs. time.

* in top_num.png / bot_num.png, you can find the number of all lipids in clusters vs. time.

* in topsize.png / botsize.png, you can find the distribution of the cluster size.

(5) Visualization

If you want to visualize the clusters, please use the visualize.py file.

a. activate conda environment

b. Make corresponding edits in visualize.py

c. define colors for different headgroups at the beginning of clustering.py.

def color_(head):
    # used for cluster visualization
    if head == 'pe':
        return 'yellow'
    elif head == 'pc':
        return 'yellow'
    elif head == 'pg':
        return 'yellow'
    elif head == 'pi':
        return 'yellow'
    elif head == 'ps':
        return 'yellow'
    elif head == 'sito':
        return 'b'
    elif head == 'cer':
        return 'r'

* Could also use “else:” command to minimize repetitive line entry for colors of lipids not of our interest

d. You can define what's included in the cluster plot by editing the top_res_info and bot_res_info in visualize.py. (Note: the code structure is identical to that in run_cluster.py, but you do not have to include all lipids in the system and just the lipids of interest.)

e. Usually it takes about 10 seconds to generate two cluster pictures (upper&lower leaflets) for one frame, so that I won't use slurm. So, just type python visualize.py and wait for a while (remember to activate the conda env you created). However, if you want to create pictures for more than 10 frames, please consider using slurm. Again, the instructions are within the code.