Home HMCan HMCan-diff Download ENCODE peaks Contact  


1. HMCan-diff Algorithm

HMCan-diff is a method designed specially to detect changes of histone modifications in ChIP-seq cancer samples or between a cancer sample and a normal control. HMCan-diff explicitly corrects for copy number bias as well as for other ChIP-seq technical biases such as GC-content and mappability biases, and variable levels of signal-to-noise in different samples. HMCan-diff uses a three state hidden Markov model to detect regions of differential histone modifications.

Schematic workflow of HMCan-diff:

HMCan-diff workflow

2. Installation

2.1. Download

HMCan-diff and all files necessary to run it are available on Bitbucket. It can be downloaded using git with the following command:
$ git clone https://pyminer@bitbucket.org/pyminer/hmcan-diff.git

2.2. System Requirements

HMCan-diff runs on Linux and Mac OS systems. HMCan-diff requires:
  • GCC compiler in order to compile HMCan-diff.
  • GNU Scientific Library (GSL) to be installed on the system. Please check https://www.gnu.org/software/gsl/ for more instructions.
  • Samtools to be installed on system path if BAM files are used (Optional).

2.3. Compilation

In order to compile HMCan-diff use the following commands:
  • $ cd hmcan-diff/src/
  • $ make
Compiled binary will be created in the hmcan-diff/src/ folder.

3. External Files

HMCan-diff requires three types of external files; it needs FASTA files of the reference genome, a GC content profile, and a blacklist file.

3.1. Reference genome FASTA files

Reference genome FASTA files are used to construct GC content profiles on fragment level to use in the GC content bias normalization process. Files should be placed in a folder such that each chromosome sequence is located in a separate file. Names of the files should be identical the one used in the alignment file. Files should have “.fa” suffix (e.g., chr1.fa, chr2.fa, etc.).

3.2. GC content profile file

This file contains GC content and mappability scores for large regions of DNA. This information is used to calculate the copy number profile of ChIP-seq data. We provide precalculated "GC_profile_100KbWindow_Mapp76_hg19.cnp" and "GC_profile_mm9.cnp" in the data directory. Provided files use 100 kbp window size. Please set --largeBin option in HMCan-diff to 100000. If a different window size is needed, please, run the GCCount tool (http://www.cbrc.kaust.edu.sa/hmcan/GCCount.tar.gz) to construct a different GC content profile.

3.3. Blacklist file of genomic regions to mask

This file contains regions that will be excluded from HMCan-diff calculations. Usually these regions are repetitive regions and regions with low mappability. We provide “hg19_blacklist.bed” in the data directory. This file was created by the ENCODE consortium. Blacklist file format is BED format with three columns, which are chromosome, start, and end of the region.

4. Required input

HMCan-diff expects aligned ChIP-seq reads in one of the following formats: BAM, SAM and BED.
To work with the BAM format, SAMTools should be installed and added to the system path.

5. Output files

HMCan provides as an output 4 files:
  • Peaks: a BED file contains the coordinates of the differential regions between two conditions for a certain mark.
  • Regions: a BED file contains the coordinates of the differential regions between two conditions for a certain mark.
  • Density: WIG files contain the normalized for each data bin for each sample.
  • Posterior Probability: WIG files contain the posterior probability of each bin to be enriched given its value for each state.
"Regions" file is useful when analyzing ChIP-seq data for broad peaks like H3K27me3, H3K36me3.

Example of HMCan-diff "peaks" and "regions" files:

ChromosomeStartEndNameScoreStrandDifferential statelog2(density fold change)
chr1219297349219301500peak1623954.009502.condition2-1.762861
chr1219406199219410700peak1624656.637878.condition2-1.665246
chr1219515549219520000peak1625155.166691.condition2-2.050705
chr1235478399235484650peak1688642.370319.condition11.425381
chr1235591299235597300peak1688839.039383.condition11.344224
chr1235704249235710450peak1689670.569687.condition11.411959

Second type of files HMCan-diff outputs is a fixed step Wiggle (WIG) file. WIG file contains normalized density for each ChIP-seq replicate. Also, optionally, HMCan-diff can produce three more WIG files containing posterior probability of each state.

6. Test example

After you have successfully downloaded and compiled HMCan-diff, it is time to rut a test example showing the utility of the HMCan-diff tool.
  1. Download test files from (http://www.cbrc.kaust.edu.sa/hmcan/hmcan-diff_example.tar.gz) into the hmcan-diff directory.

  2. Extract example files using the following command:
    tar -xf hmcan-diff_example.tar.gz

  3. After extracting files to the test directory, run HMCan-diff using the following command:

    src/HMCan-diff --name hmcan-diff_example --C1_ChIP hmcan-diff_example/C1_files.txt --C2_ChIP hmcan-diff_example/C2_files.txt --C1_Control hmcan-diff_example/C1_control.txt --C2_Control hmcan-diff_example/C2_control.txt --format SAM --genomePath hmcan-diff_example/reference/ --GCIndex data/GC_profile_100KbWindow_Mapp76_hg19.cnp --C1_minLength 145 --C1_medLength 150 --C1_maxLength 155 --C2_minLength 145 --C2_medLength 150 --C2_maxLength 155 --blackListFile data/hg19-blacklist.bed

    Running the above command will result in producing two files: hmcan-diff_examples_peaks.bed and hmcan-diff_examples_regions.bed

7. HMCan-diff parameters description

--helpShows this help message and exits.
--versionShows program's version number and exits.
--nameThis option passes to HMCan-diff the prefix string for all output files.
--C1_labelThis option passes to HMCan-diff the label for condition 1. This label will appear as the name of condition in the BED files, also it will be included in the WIG file names for ChIP-seq signal.
--C2_label This option passes to HMCan-diff the label for condition 2. This label will appear as the name of condition in the BED files, also it will be included in the WIG file names for ChIP-seq signal.
--C1_ChIPThis option passes the path for the files containing ChIP-seq replicates for condition 1 data.
--C2_ChIPThis option passes the path for the files containing ChIP-seq replicates for condition 2 data.
--C1_ControlThis option passes the path for the files containing Input DNA files for condition 1 data.
--C2_ControlThis option passes the path for the files containing Input DNA files for condition 2 data.
--formatThis option passes the format of the alignment files for any run of HMCan-diff. HMCan-diff accepts BED, SAM, and BAM formats
--genomePathThis option passes the path for the directory containing chromosome files for the reference genome used for any HMCan-diff run. Each chromosome should be in a separate file, and has the suffix of .fa.
--GCProfileThis option passes the path for the GC content profile file used in the copy number estimation step.
--C1_minLengthThis option passes the minimum fragment length for condition 1 data.
--C1_medLengthThis option passes the median fragment length for condition 1 data.
--C1_maxLength This option passes the maximum fragment length for condition 1 data.
--C2_minLength This option passes the minimum fragment length for condition 2 data.
--C2_medLength This option passes the median fragment length for condition 2 data.
--C2_maxLength This option passes the maximum fragment length for condition 2 data.
--StepSize This option passes to HMCan-diff the size of the density step, which HMCan-diff uses to sample from the density profile. Lower values will increase profile resolution, while increasing execution time and memory consumption. Recommended range is [1, 1/2 of the minimal fragment length of DNA fragments in the library]. Default: 50
--largeBin This option passes the size of the large windows HMCan-diff uses to estimate copy number bias. This option value should match the values in the GC profile file. Default: 100000
--negativeBinomial This option enables using binomial distribution to identify enriched regions.Default: use local Poisson distribution
--pvalueThreshold This option passes the P-value threshold for one sided Poisson exact test to consider a bin enriched or not. Default: 0.01
--mergeDistance This option reports the distance threshold used to merge nearby differential peaks into a single region. Default: 1000
--iterationThreshold This option passes the threshold for the differential peak score to consider peaks in the training phase of HMCan-diff. Peaks with scores less than this threshold will not be considered as signal in that iteration.
--finalThreshold This option passes the threshold for the differential peak score to be reported as an output. Default: 0
--maxIter This option passes the maximum number of iterations that HMCan-diff will run. Default: 20
--PosteriorProb This option passes the threshold for posterior probability to consider a density value to belong to differential state or background. Default: 0.7
--PosteriorProb This option passes the threshold for posterior probability to consider a density value to belong to differential state or background. Default: 0.7
--PrintWig This option enable HMCan-diff to report the WIG files for normalized density for each replicate.
--printPosterior This option enable HMCan-diff to report the WIG files for posterior probabilities for each state.
--blackListFile This option passes the path for file containing regions that should be excluded by HMCan-diff.
--fold_change This option passes the value for the fold change to consider whether a density value is differential or not. Default: 2
King Abdullah University of Science and Technology / Computational Bioscience Research Center ©2017