InMeRF is a tool to predict the pathogenicity of nonsynonymous SNVs (nsSNVs) using 150 discriminant models independently generated for all possible amino acid (AA) substitutions.

Publication

Materials and Methods

  1. A total of 72,556 pathogenic nsSNVs were extracted from the Human Gene Mutation Database (HGMD) Pro 2015.2 [CLASS = DM (disease-causing mutation)] included in dbNSFP v4.0a.
  2. A total of 166,161 common nsSNV candidates were extracted from dbNSFP v4.0a based on dbSNP build 151 with at least one minor allelic frequency (MAF) of 1000Gp3_AF, UK10K_AF, ExAC_AF, gnomAD_exomes_AF and gnomAD_genomes_AF is > 0.001. We then filtered 162,918 common nsSNVs by removing nsSNVs included in HGMD and in dbNSFP v4.0a with “clinvar_clnsig = Pathogenic or Likely_pathogenic”.
  3. Each nsSNV was classified into one of 150 different nonsynonymous AA substitutions. The pathogenic nsSNVs were sorted in ascending order of MAF, and the common nsSNVs were sorted in descending order of MAF. The same numbers of pathogenic and common nsSNVs were extracted for each AA substitution for random forest (RF) modeling.
  4. Among 37 tools in dbNSFP v4.0a, nsSNV coverages of 3 tools were very low in either pathogenic or common nsSNVs. Therefore, rank scores of the remaining 34 tools in dbNSFP v4.0a were used as feature values (Table 1). To make RF models, nsSNVs that lacked one or more of 34 rank scores in dbNSFP v4.0a were excluded. Then, pathogenic and common nsSNVs were discriminated by using a machine learning library, scikit-learn, on Python version 3.7. Finally, a total of 150 RF models were generated (Figure 1).

Table 1. 37 tools in dbNSFP v4.0a and their nsSNV coverages in all, pathogenic and common nsSNVs.
Tool Type Rate in all nsSNVs
(77,195,651)
Rate in pathogenic nsSNVs
(72,556)
Rate in common nsSNVs
(162,918)
Feature values used for RF models
SIFT prediction 92.65 97.31 89.62 O
SIFT4G 95.63 98.12 93.46 O
Polyphen2_HDIV 87.13 92.14 80.88 O
Polyphen2_HVAR 87.13 92.14 80.88 O
LRT 82.39 93.93 72.45 O
MutationTaster 96.89 99.94 95.72 O
MutationAssessor 82.45 89.32 76.07 O
FATHMM 88.83 98.27 87.35 O
PROVEAN 93.15 98.29 90.39 O
VEST4 97.31 99.35 95.72 O
MetaSVM 95.82 99.40 94.08 O
MetaLR 95.82 99.40 94.08 O
M-CAP 95.90 97.39 37.24 X
REVEL 95.82 99.40 94.08 O
MutPred 90.22 81.09 6.21 X
MVP 97.80 99.12 73.85 O
MPC 83.00 91.76 75.79 O
PrimateAI 89.88 96.72 85.13 O
DEOGEN2 91.13 94.52 86.73 O
CADD 99.97 100.00 100.00 O
DANN 99.41 100.00 100.00 O
fathmm-MKL 99.41 100.00 100.00 O
fathmm-XF 92.62 86.76 92.20 O
Eigen 92.49 87.62 92.02 O
Eigen-PC 92.49 87.62 92.02 O
GenoCanyon 99.41 100.00 100.00 O
integrated_fitCons 95.45 87.68 97.44 O
LINSIGHT 2.04 0.07 3.52 X
GERP++ conservation 98.95 99.98 98.51 O
phyloP100way_vertebrate 99.99 100.00 99.97 O
phyloP30way_mammalian 99.96 100.00 99.94 O
phyloP17way_primate 99.92 100.00 99.90 O
phastCons100way_vertebrate 99.99 100.00 99.97 O
phastCons30way_mammalian 99.96 100.00 99.94 O
phastCons17way_primate 99.92 100.00 99.90 O
SiPhy 97.98 99.88 97.09 O
bStatistic 98.24 98.93 98.02 O



Figure 1. Overview of strategies for InMeRF and InMeRF-CADD. In InMeRF-CADD, pathogenic and common nsSNVs in CADD instead of HGMD and dbSNP were used to compare with other tools under the same conditions.