Pipeline scripts for identifying differentially enriched H4K16ac peaks
Greg Donahue, 11-11-2017

********************************************************************************
NOTE

All software presented here is ©2017 Greg Donahue, ©2017 the Berger and Zaret labs and ©2017 The University of Pennsylvania Perelman School of Medicine.

********************************************************************************
OVERVIEW

Presented in this directory are seventeen bash, Python, or R scripts, to be run in numbered order, which produce a table of H4K16ac peaks that may be sorted and filtered to identify differential enrichments between study groups. The pipeline was developed for use with the University of Pennsylvania's PMACS high-performance cluster, an LSF environment. The table it generates was used in Nativio et al, Nature Neuroscience 2018 (accepted).

The pipeline has the following dependencies:
LSF (bsub)
python-2.7.5, including scipy/numpy
MACS2
bedtools
R

The R, python, bedtools and MACS2 binaries must be visible to the PATH environment variable. Additionally, several annotation files are included:
SampleSheet.csv
SampleCovariates.csv
RefSeq.hg19.bed (annotation file downloaded from UCSC Genome Browser)
RefSeq.hg19.RNA_Only.bed (the same, filtered for genes in the RNA-seq)
RefFlat.hg19.txt (a flat file mapping transcript IDs to official gene symbols)

And there are supplementary tables for the RNA-seq data (of which the first is a placeholder):
DESeq Table.txt
DESeq Table.v2.txt

********************************************************************************
INSTALLATION

Download the zip file Nativio2018Pipeline.zip and unzip, then enter the folder:

$> unzip Nativio2018Pipeline.zip
$> cd Nativio2018

Add the folder to the PATH variable:

export PATH="$PATH":PATH_TO_FOLDER/Nativio2018

Add the folder to the PYTHONPATH variable:

export PYTHONPATH="$PYTHONPATH":PATH_TO_FOLDER/Nativio2018

********************************************************************************
USAGE

What follows is a description of what the pipeline does and how to use it.

Step 0: ./0.preprocess.sh
The user must have already prepared tag-aligned BED files for both the chIP and the corresponding input for each patient. For the Nativio et al data set, these will be made available upon request (though the corresponding FASTQs can be downloaded from NCBI GEO and aligned using the protocol in the methods of Nativio et al Nature Neuroscience 2018). BEDs are also expected to be filtered for unique (i.e., non-PCR duplicate) tags only; this is not technically part of the pipeline but can be accomplished using the provided uniquetags.py script.

Additionally, the user must have prepared the SampleSheet.csv file, a comma-separated value file encoding on each line the patient name, study group, study group code, the absolute path of the chIP BED file for that patient, and the absolute path of the input (or sonication efficiency control) BED file for that patient. See the example SampleSheet.csv provided.

Running 0.preprocess.sh creates a number of directories, including a BEDs directory that will contain symbolic links to the user-provided BED files. It also creates pooled BED files for each study group, and invokes MACS2 to call peaks for patient chIPs and the pooled chIPs.

Step 1: python 1.build_mtls.py
This step constructs multiple transcription factor loci (MTLs) from the peaks called in the previous step. It writes a number of files cross-referencing the pooled peaks against the per-patient peaks in the Associations directory, then writes MTLs into MTLs/MTLs.Expanded.bed.

Step 2: ./2.calculate_aucs.sh
This step calculates and writes into the AUCs directory the library size-adjusted, length-adjusted scores for each patient's chIP and input over (i) the MTLs and (ii) the study group peaks.

Step 3: python 3.build_tables.py Young,Old Old,AD Young,AD
This step finds the nearest RefSeq transcript to each study group peak (judging from the peak center to the TSS). It then creates a table for each study group, and for each peak called in that study group it adds this information, the average peak signal for chIP and input, the two-sided unequal variance t test (Welch's test) p-value for each comparison listed on the command-line, the DESeq adjusted counts for RNA-seq, and the DESeq-computed p-values for the same. The resulting table is deposited in the Peaks directory.

Step 4: ./4.reference_mtls.sh
This step simply cross-references study group peaks with MTLs. It generates files in the Associations directory.

Step 5: python 5.mtl_tables.py 
This step processes the three study group peak tables and the MTLs file and makes a new table (Multipeak Table.MTLs.txt) which produces a table similar to those from step 3 for the MTLs. It additionally includes a binary flag to indicate whether the peak is present or absent in each of the study groups. This table is the base for the rest of the analysis.

Step 6: python 6.get_correlated.py
This step identifies the 50,000 MTLs from the multipeak table most correlated to neuronal fraction and removes them from the multipeak table and from the expanded MTLs file.

Step 7: python 7.collapse.py Multipeak\ Table.MTLs.txt > Multipeak\ Table.MTLs.Collapsed.txt
This step creates a version of the multipeak table in which only the closest peak to each RefSeq transcript is kept.

Step 8: python 8.annotate_mtls.py
This step does some simple text processing, adding the peak locus to the third column and creating a new file called MTLs.Annotated.bed.

Step 9: python 9.assign_nearest.py
Assigns RefSeq transcripts to their nearest MTLs. The output is in Nearest/RefSeq.hg19.bed_MTLs.Annotated.bed.

Step 10: python 10.assign_nearest.py
Assigns MTLs to their nearest RefSeq transcripts. The output is in Nearest/MTLs.Expanded.bed_RefSeq.hg19.bed.

Step 11: python 11.correct_tables.py
Uses the MTL->gene lookup tables produced in steps 9 and 10 to correct the nearest gene to each MTL in the collapsed and original multipeak tables. New tables (Multipeak Table.MTLs.Corrected.txt and Multipeak Table.MTLs.Collapsed.Corrected.txt) are generated.

Step 12: python 12.get_anova.py
Computes a one-way ANOVA p-value for the MTL AUC measurements and deposits the results in ANOVA Table.txt.

Step 13: Rscript 13.crunch_q.r
Assesses Benjamini-Hochberg q-values for all results from step 12.

Step 14: python 14.merge_anova.py
Combines q-values with collapsed and original corrected tables, creating new files called *.ANOVA.txt.

Step 15: python 15.annotate.py
Adds official gene symbols to the tables.

Step 16: python 16.swap_RNA.py
Re-annotates the RNA-seq section with an updated RNA-seq data file.

At this stage, the multipeak tables are complete and should be ready for further analysis.