Monday, November 16, 2015

ChIP-Seq analysis BaseSpace App


ChIP-Seq analysis BaseSpace App


Introduction

Chromatin immuno-precipitation (ChIP) is a technique catered to discover and isolate protein-DNA complexes in order to define the function and localization of transcription factors (TF). Since the development of deep sequencing the two procedures have been coupled to enable genome-wide effectiveness to the procedure. Despite the wide range of uses and increasing interest, analysis of sequencing data require a certain degree of bioinformatics skills and time that not all biologists may have. The recent advent of cloud computing has moderated the need for biologists to acquire dedicated hardware. An other step for the easy access to bioinformatics to biologists is provided by the recent development of open-source and commercial workbenchesBaseSpace is a commercial workbench developed by Illumina which provides a user friendly interface to bioinformatics tools. In BaseSpace framework, complex bioinformatics pipelines are wrapped in graphical interfaces, called App, that result intuitively easy to use for biologists. Within BaseSpace framework is present ChIP-Seq analysis, a free App designed to make simple the identification and annotation of TFs binding sites.

Introduction/App Overview


This app allows the detection and annotation of significant TF binding sites (TFBS) within a sample and is based on the workflow proposed and used by Galli et al. (2015) The app performs the following steps:

  1. Peak calling using either MACS v 1.4 or SICER v 1.1, depending on the characteristics of the peaks,
  2. Connection of the peak to the nearest gene available in the chosen UCSC genome release,
  3. Full annotation of the nearest gene,
  4. Save results in a folder, i.e. data, to be downloaded for further analysis. The description of the content of the data folder is shown in README file within the data folder.

Inputs


The accepted input files are in .bam format, generated with any alignment modern alignment protocol or tool, such as Bowtie, TopHat, SHRiMP or STAR. BaseSpace does not allow direct upload of BAM files, therefore the samples must be uploaded as fastq files and aligned with an existing app beforehand.

The analysis requires the presence of two conditions: treatment and background. Peak calling procedure is performed on treatment, using background to model the noise of the system. For this reason it is usual to provide a background from the same system, but generated using an immunoglobulin independent to the treatment.

Users are required to define only the BAM files to work on (provided they are already available in BaseSpace) and the output project. For more advanced users there is the possibility to tweak the options of the peak-calling software. This, however, is not mandatory, because all options have been already set to a default that should fit most analyses.

Outputs

The app packs the output files in data folder containing the following items:
  • README: A file describing the content of the data folder
  • mypeaks.xls: All detected peaks alongside the nearest gene and its annotation
  • mytreat.counts: The total reads count for the provided treatment file
  • mycontrol.counts: The total reads count for the provided control/background file
  • peak_report.xls: Aggregate information regarding the peak and their position relative to the nearest gene
  • chromosome_distribution.pdf: Barplot of the distribution of the peaks on the chromosomes
  • relative_position_distribution.pdf: Barplot of the distribution of the peaks positions relative to their nearest gene
  • peak_width_distribution.pdf: Histogram of the distribution of the width of the peaks
  • distance_from_nearest_gene_distribution.pdf: Histogram of the distribution of the distance of each peak from its nearest gene
  • cumulative_coverage_total.pdf: Cumulative normalized gene coverage
  • cumulative_coverage_chrN.pdf: Cumulative normalized gene coverage for the specific chromosome
  • mycontrol_sorted.bw: bigWig file for UCSC Genome Browser visualization
  • mytreat_sorted.bw: bigWig file for UCSC Genome Browser visualization
     

Limitations

  • Each input sample needs to be provided as a unique BAM file. The input form is built to avoid multiple uploads for each condition.
  • The app only supports peak-calling via background noise definition by external source.

Citations

The app is derived by the bioinformatics procedures described in Galli et al. Mol.Cell 2015

Support

The developer is Matteo Carrara, PhD student in Complex Systems for Life Sciences at University of Turin.
The maintainer is Raffaele Calogero, Associate Professor of Molecular Biology and Bioinformatics at University of Turin. Users are kindly requested to provide bugs report and feed backs to raffaele[dot]calogero[at]unito[dot]it
Two demo datasets are also available, one to test the peak-calling procedure for TFs and sharp peaks with MACS and one to test the procedure for histone marks and broad peaks with SICER. Each demo dataset contains two samples: treatment and background. All data have been aligned using TopHat App and the BAM files are available in the output folders under the name "alignments/X.alignments.bam"



Acknowledgement 
This App was developed within the framework of NGS-PTL FP7 project, and EPIGEN FLAG Project