Data:
  • Complete release data including the original motifs are available at Zenodo.
  • Harmonized list of human transcription factors and respective mouse orthologs based on the TFClass classification: tf_masterlist.tsv.
Tools:
  • MoLoTool - web interface for motif finding.
  • SPRY-SARUS tool for motif finding (Java): jar, readme
  • MACRO-APE tool for motif comparison, P-value and threshold estimation: jar, manual, website
  • PERFECTOS-APE tool for functional annotation of sequence variants overlappint TFBS: jar, manual, website
Contacts
Citation:
Ivan V. Kulakovskiy; Ilya E. Vorontsov; Ivan S. Yevshin; Ruslan N. Sharipov; Alla D. Fedorova; Eugene I. Rumynskiy; Yulia A. Medvedeva; Arturo Magana-Mora; Vladimir B. Bajic; Dmitry A. Papatsenko; Fedor A. Kolpakov; Vsevolod J. Makeev
Nucl. Acids Res., Database issue, gkx1106 (11 November 2017)
doi: 10.1093/nar/gkx1106
License: HOCOMOCO motif collection is distributed under WTFPL. If you prefer more standard licenses, feel free to treat WTFPL as CC-BY.

Many practical motif applications require a set of motifs with reduced redundancy i.e. where similar motifs belonging to related transcription factors are grouped together and only a single matrix represents the group. To this end, we have created the non-redundant set of HOCOMOCO v12 motifs, a derivative of the HOCOMOCO v12 CORE collection.

To this end, we estimated the motif similarities with MacroAPE (see opera.autosome.org/macroape and doi:10.1186/1748-7188-8-23) at the motif P-value cutoff of 0.0005 and default matrix discretization of 1 (upscaled to 10 to reach a better precision for the cases when similarity estimates with the default discretization exceeded 0.01).

Using the pairwise motif similarity matrix, we performed hierarchical clustering using sklearn agglomerative clustering ('average' linkage). The number of clusters was taken to maximize the silhouette score resulting in 523 clusters at the silhouette score of 0.16.

For each cluster, the single representative motif was taken according to the best average similarity to other motifs in the cluster. The annotation contains a list of motifs that constitute a cluster and the list of respective TFs (UniProt IDs).


HOCOMOCO v12 subcollections

H12CORE H12INVIVO H12INVITRO H12RSNP
Number of motifs 1443
(MOUSE subset: 1161)
1443
(MOUSE subset: 1161)
1427
(MOUSE subset: 1145)
1443
(MOUSE subset: 1161)
Complete model annotation
(including gene id mapping)
All motifs H12CORE_annotation.jsonl H12INVIVO_annotation.jsonl H12INVITRO_annotation.jsonl H12RSNP_annotation.jsonl
MOUSE subset H12CORE-MOUSE_annotation.jsonl H12INVIVO-MOUSE_annotation.jsonl H12INVITRO-MOUSE_annotation.jsonl H12RSNP-MOUSE_annotation.jsonl
PWM One file per matrix
H12CORE_pwm.tar.gz H12INVIVO_pwm.tar.gz H12INVITRO_pwm.tar.gz H12RSNP_pwm.tar.gz
Flat file H12CORE_pwms.txt H12INVIVO_pwms.txt H12INVITRO_pwms.txt H12RSNP_pwms.txt
PCM One file per matrix
H12CORE_pcm.tar.gz H12INVIVO_pcm.tar.gz H12INVITRO_pcm.tar.gz H12RSNP_pcm.tar.gz
Flat file H12CORE_pcms.txt H12INVIVO_pcms.txt H12INVITRO_pcms.txt H12RSNP_pcms.txt
PFM One file per matrix
H12CORE_pfm.tar.gz H12INVIVO_pfm.tar.gz H12INVITRO_pfm.tar.gz H12RSNP_pfm.tar.gz
Flat file H12CORE_pfms.txt H12INVIVO_pfms.txt H12INVITRO_pfms.txt H12RSNP_pfms.txt
Alignments H12CORE_words.tar.gz H12INVIVO_words.tar.gz H12INVITRO_words.tar.gz H12RSNP_words.tar.gz
Threshold to P-value map
H12CORE_thresholds.tar.gz H12INVIVO_thresholds.tar.gz H12INVITRO_thresholds.tar.gz H12RSNP_thresholds.tar.gz
Matrices in other formats JASPAR H12CORE_jaspar_format.txt H12INVIVO_jaspar_format.txt H12INVITRO_jaspar_format.txt H12RSNP_jaspar_format.txt
MEME H12CORE_meme_format.meme H12INVIVO_meme_format.meme H12INVITRO_meme_format.meme H12RSNP_meme_format.meme
TRANSFAC H12CORE_transfac_format.txt H12INVIVO_transfac_format.txt H12INVITRO_transfac_format.txt H12RSNP_transfac_format.txt
HOMER