HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) contains transcription factor (TF) binding motifs represented as classic Position Weight Matrices (PWMs, also known as Position-Specific Scoring Matrices, PSSMs).
The PCM to PWM conversion scheme used in HOCOMOCO follows that of MACRO-APE, see the respective manual, page 20–21. Uniform background was used in this process, as well as when estimating the downloadable threshold-to-P-value tables.
HOCOMOCO motifs were constructed with ChIPMunk by systematic motif discovery from thousands of ChIP-Seq and HT-SELEX datasets. Please refer to the HOCOMOCO v12 paper for more details on the motif discovery procedure.
[Motif finding; Sequence scanning]
HOCOMOCO provides PWMs accompanied by precomputed score thresholds. The thresholds and P-value for HOCOMOCO v12 motifs are estimated against uniform background probabilities. To interactively visualize predicted TFBS in a small set of sequences we provide MoLoTool. For large-scale analysis, we suggest using command-line tools, such as our SPRY-SARUS or MEME's FIMO.
[Motif benchmarking; Performance metrics]
To assemble the motif collection of HOCOMOCO v12 we have used multiple benchmarking protocols evaluating the motif performance for TFBS recognition in genomic regions (in vivo data, ChIP-Seq), in artificial oligonucleotides (in vitro data, HT-SELEX), and for predicting regulatory single-nucleotide variants and polymorphisms (rSNPs). Please refer to the HOCOMOCO v12 paper for more details on benchmarking protocols and resulting performance metrics.
Each model in the collection has a quality rating from A to D where A represents motifs with the highest confidence. A quality motifs and subtypes were found in both HT-SELEX and ChIP-Seq, B quality motifs are found in at least two different experiments of the same type, and C quality motifs passed expert curation but were found in a single experiment. In the core collection, D quality marks subtypes which included only motifs inherited from HOCOMOCO v11, and in v12 there are only a few such cases. In sub-collections, D quality denotes all motifs not tested in the respective benchmarks (ChIP-Seq for v12-invivo, HT-SELEX for v12-invitro, rSNP for v12-rsnp).
Since v11 the alternative binding motifs of a particular TF are ranked from 0 (the primary model) to 1,2,.. (the alternative motifs). The motifs of 0 rank are the most 'general' variants with the best performance across available data in the benchmark (see the HOCOMOCO v12 paper for details).
HOCOMOCO v12 used two data types for motif discovery: ChIP-Seq and HT-SELEX. The latter came in two variants: traditional HT-SELEX and methyl-HT-SELEX with mCpGs. Additionally, in benchmarking, we used information on differential transcription factor binding to single-nucleotide variants obtained in SNP-SELEX and identified from ChIP-Seq (the allele-specific binding, see ADASTRA).