NTAP Manual

NTAP was developed in the process of our own generated NimbleGen tiling array data analysis. Specifically, the tiling array to survey the genome-wide binding sites of transcription factor HY5 in Arabidopsis (The Plant Cell 2007, 19:731-749.) and the genome-wide histone modifications/DNA methylation level in rice (The Plant Cell 2008, 20:259-276). To discover the biological knowledge from the tiling array data, we did lots of comparisons after the data normalization and significantly enriched feature identification. For example, we compared various histone modifications level between: the protein-coding genes and transposons; the high expressed genes and low expressed genes; the large size genes and short size genes...For each group, we used different criteria to define its gene members. NTAP contain functions to do those comparison for us once we decide the gene groups. And the analysis is as simple as putting our genes list into the "genelists" sub-directory and run a function of NTAP. Here is the manual of NTAP including input files preparation and data analysis using our own tiling array data set (O.sativa, 380k design). Tutorial using another public available sample dataset (C.elegans 2M design, NIH GEO ID: GSE14653) is also provided to show users how to adjust NTAP parameters to fit their own requirements.

If you have any further questions about the usage of NTAP, please feel free to contact us.

 I. Prepare input/configuration files for NATP

All the following files should be put under the NTAP program directory. The filenames are not mandatory, you can specify your own filenames and use them instead when using NTAP functions.

  1. Raw NimbleGen Data

Rename all the raw data files into "Chip_ID" + "Channel_Suffix" format. For example, "57991_cy3.mrg" and "57991_cy5.mrg" contains the data of cy5 and cy3 channel for the same chip "57991". The Chip_ID should be identical to the ids specified in the above "Ids.txt" file. "Channel_Suffix" is required by the readNimbleMrg() function through parameters "ipChannel" and "refChannel", see the sample data analysis below for details.

Put all the files into a sub-directory "data". The scatterplot figures will be generated into the same directory later.

  1. "SpotTypes.txt"

It contains the table to describe the different types of spots on the tiling array. Example:

SpotType     Probe_Class     Color
Oligo        *\+             black
Barcode      barcode         brown
barcodeM     barcode_mark    orange
barcodeE     barcode_empty   green
Control      control         blue

The file format is defined by Limma package to clarify different types of spots when do the plotting. It's not necessary for the subsequent analysis, but mix oligos with the biological irrelevant chip marker spots could introduce bias to the normalization procedure.

The second column here is the key data to mark different types of spots on the array. The raw data in this column should have power to distinguish different types of spots. Please don't change the title line in this file. Instead, plese customize your chosen column title in the "readNimbleMrg" function. For more information about the "SpotTypes.txt", please read the user guide for Limma.

  1. "Ids.txt"

It contains all the chip ids included in one single experiment. For example, there are 3 replicates in an experiment, and the ids for the 3 replicates are 57991, 59178 and 59302 respectively. Then the Ids.txt should have 3 lines and each line is a single chip id:

57991
59178
59302

The chip ids don't have to be numbers, you can use whatever name you like as the chip ids, but please make sure your raw data filenames contain the same id you specified here. Or, you must write down the exact chip id in your filenames if you don't want to rename your raw data files.

  1. "OligoSites.txt"

The version of genome data release is usually updated by the genome sequencing or annotation consortium. So that it's important to keep up-to-date oligos position list according to the most recent genome data release. You can either remap your oligos by BLAST/BLAT tools, or download the up-to-date mapping file from the chip ventor's user service website. The file need to be in such a format:

47288202    0      4
47288203    83     4
47288204    167    4
47288205    277    4
47288206    491    4
47288207    626    4
47288208    773    4
...

The first column is the oligo ID, which should be match the oligo ID in your raw data file.
The second column is the position of the corresponding oligo.
The third column is the chromosome id which the oligo was mapped on, make sure it's in identical format as in the "Genes.coords" file described below.

  1. "Genes.coords"

This file should contain all the gene coordinates information along the chromosome in following format:

LOC_Os04g12990  7148291  7146697 4
LOC_Os04g05680  2892064  2889121 4
  
The first column is the GeneIDs, second and third are gene coordinates, if a gene is on the positive strand, the number is bigger in the second column, otherwise, the number is bigger in the third column. The strand information is useful to determine whether two neigboring genes could share the same promoter region. The last column is the ChromosomeID, you can use either 4 or "IV" to indicate chromosome 4, but the later one is better if you are willing to present your result in GBrowse later.

Besides, make sure your "OligoSites.txt" and "Genes.coords" are using the same chromosome ids.

II. Using NTAP to process data and generate reports/figures

The R script below is included in the run.R file under NTAP directory. Most parameters are customizable.

# Specify the NTAP directory. "." indicates the current directory. If R is started outside of NTAP directory, please replace "." with the corresponding path.

wd <- "."
# Load NTAP source codes.
source("ntap.R")

# Get the SpotTypes, chip IDs and OligoSites on the chromosome. Please change the filename if you used another filename. 
types <- readSpotTypes("SpotTypes.txt")

# Get all the chip IDs from your input data file "Ids.txt", change it if you used another filename. 
ids <- scan(file="Ids.txt", what="character", quiet=T, comment.char="#")

# Import all the up-to-date oligo sites information.
oligoSites <- read.table("OligoSites.txt", header=F, sep="\t")

# Specify the column titles in your raw data file including (in order please):
# Oligo IDs, Row number on array, Column number on array, Position on chromosome, Description of the spot (usually used to character oligo types).
annotation <- c("FID", "X_BC","Y_BC", "Position", "Probe_Class")
# Import the raw data into NTAP, please notice that the raw data filename suffix should be assigned to parameters "refChannel" and "ipChannel" correspondingly. 
# "valueTitle" is the title of the intensity data column in the raw data files.
# Change the "skip" value if the title line is not in the first line of the data files.
dataRaw <- readNimbleMrg(path=paste(wd,"/data",sep=""), ids, skip=0, refChannel="_cy5.mrg", ipChannel="_cy3.mrg", sep="\t", valueTitle="Signal_Median")

# Calculate the correlation coefficient before normalization (Optional step, only if you want to check the raw data consistency)
plotCor(dataRaw, suffix="NN", path="plots")

# Use loess method to do normalization between replicates. User may choose different normalization method provided by limma package through the "method" parameter.
dataMA <- normalizeWithinArrays(dataRaw, method="loess")
# Convert the ratio (M) and average intensity (A) data back to intensity values. 
dataRG <- RG.MA(dataMA)
# Calculate the correlation coefficient between replicates after normalization to check if the normalization works. 
plotCor(dataRG, suffix="WT", path="plots")

# Merge the values from different replicates, write normalized data into text files for inspection and generate the data object "wtMerge" to following functions as well. 
wtMerge <- writeMerge(dataRG, suffix="WT", path="results")
# Update the oligo position information to the most up-to-date genome annotation version, return and generate the data object "draw".
draw <- sortChrom(oligoSites, wtMerge, path="results")
# Use wilcoxon rank-sum test to identify the significantly histone modifications enriched region on Chromosome 4.
pvalue <- wilcoxPvalue(ids, c("4"), draw, swStep=50, segLength=1e05, expName="ntap-example", path="results", alternative="greater", e=1e06)
# Using the gene coordinates information, assign corresponding oligos to specific genes and prepare the ".rec" file for sliding window plots. 
# Plese change the parameter according to your own filenames. 
pvalue2rec("ntap-example",c("4"), "Genes.coords", "ntap-example.rec", path="results")
# Read the gene lists to compare of the histone modifications pattern between.
less1kb <- read.table("./genelists/less1kb.gid")
one2two <- read.table("./genelists/1kb-2kb.gid")
gt2kb <- read.table("./genelists/gt2kb.gid")

# Specify the data files to compare. If you have multiple experiments to compare, read all them in by adding their names into the list. 
# For example, c("H3K4Me2-Leaf.rec", "H3K4Me2-Root.rec", "DNAMeth-Leat.rec").
recFiles <- c("ntap-example.rec")
# Specify the names to appear on the generated figures.
# For example, c("H3K4Me2 in leaf", "H3K4Me2 in root", "DNA Methylation in leaf") may be used to mark the above example data files. 
recFiles <- setLegendTag(recFiles, c("Sample"))

geneLists <- c(less1kb, one2two, gt2kb)
# Specify the names to appear on the generated figures.
geneLists <- setLegendTag(geneLists, c("Small", "Medium", "Large"))

# Do comparison and generate figures into the "results" subdirectory. 
# User may change the default "path" to save the figures into other directories. 
drawSW(recFiles, geneLists,"ntap-example", path="results")

The functions used to do data pre-processing are listed below:

ReadNimbleMrg: to read the raw NimbleGen data into R and convert to limma RG and MA data type.

plotCor: calculate the correlation coefficients between different replicates and generate the scatter plots for user to inspect.

writhMerge: calculate the ratio of Sample/Reference and return an object "merge" for further analysis.

FID         Sel_Crt   Position	    Cy5(Genomic)  Cy3(Histone)    Log(Cy3/Cy5)
47439226    5.65      4:+24227392   6742.77       1493            -2.17
47237021    10.00     10:+15462520  3198.66       2054.11         -0.63
47498068    1.00      4:+32696844   1293.55       996.44          -0.37
47441489    1.00      4:+24534439   5207.77       6570.77         0.33
......
	

sortChrom: read the "OligoSites.txt" and update the oligo position information to the most up-to-date genome annotation version, return and write the object "draw".

FID         Sel_Crt     Position  Cy5(Genomic)    Cy3(Histone)    Log(Cy3/Cy5) ChrNum
47288202    1.000000    0         453.44          415.66666       -0.125       4
47288203    1.000000    83        946.33          2197.7778       1.215	       4
47288204    1.000000    167       1138.88         1821.3334       0.677        4

wilcoxPvalue: use wilcoxon rank-sum test to identify the region where the sample channel intensities are significantly higher than the reference channel, return and write the object "pvalue". All replicates are used and recorded into the "pvalue" object.

Important parameters:

swStep - set the sliding window width and overlap size. If swStep=50, the sliding window width will be 100bp, and the overlap between two neighboring windows will be 50bp.
alternative - use "greater" to identify only enriched regions; use "less" to identify only depleted regions; use "two.sided" to get both.
s - if only interested with sub-region on the chromosome, specify the starting position of the sub-region by this parameter. Default value "0" will calculate from begining of the whole chromosome.
e - if only interested with sub-region on the chromosome, specify the ending position of the sub-region by this parameter. Default value "NULL" will calculate until the end of the whole chromosome.
segLength - Wilcoxon test is done for each sliding window. Therefore we can divide the whole chromosome into segments to increase the calculation speed. Default value is 1Mbps per segment.
chr - put the IDs of the chromosomes need to do Wilcox test into a list and assign to this parameter. For example, c("4","10") will calculate pvalues for both chromosome 4 and 10.

FID        SelCrt   Position  pvalue  MeanOfRatio  Cy5_57991  Cy5_59178  Cy5_59302  Cy3_57991  Cy3_59178  Cy3_59302  Ratio_57991  Ratio_59178  Ratio_59302   chr
47288202   1.0	    0         0.9     0.8          478.2      167.4      453.4      369        123.1      415.6      -0.3         -0.4         -0.1          4
47288203   1.0	    83        0.6     1.4          1082.3     373.2      946.3      1040.0     342.7      2197.7     -0.05        -0.12         1.2          4

Post-processing and comparison

pvalue2rec: the function use system() wrapper to load a perl script to generate .rec files required for the figure generation. User may also directly use the perl program to accomplish the task.

perl pvalue2rec.pl pvalueFileName  Genes.coordsFileName OutputFileName

Using the gene coordinates information, assign corresponding oligos to specific genes and prepare the ".rec" file for sliding window plots.

FID       Position   Position_to_TSS   Percentage_to_TSS   P-value  Averaged_Ratio  Sel_Crt  Locus_ID
47331376  7402893    416               0.083               0.39     1.14            1.2      LOC_Os04g13350

Oligo 47331376 is inside the locus LOC_Os04g13350's coding region; the relative position to LOC_Os04g13350's transcription start site (TSS) is 416 bps; the relative percentile of the distance to TSS compared to the whole gene's coding region is ~8.3% (0.08305...)

drawSW: plot the sliding window results (Figure Examples). The parameters for this function include: 1. the ".rec" file path; 2. the gene lists to compare among, for example, genes with no EST evidence (low expressed), with a large number of ESTs (high expressed) and all the genes (average).

III. Tutorial for analyzing another sample dataset

Dataset is downloaded from the GEO database. The experiment is GSE14653 series, including data files and the design file GPL8135_Celegans180_ChIP_v2_HX1.pos. The design file was used to generate the OligoSites.txt. Compared to our own sample dataset, their design is much more dense with 2M probes on a single array (compared to our 380k per array). Here we used the N2 SDC-3 HD chips (GSM365609_10213802 and GSM365612_10331202) as example to demonstrate the usage of NTAP. We add "Ce." to all the input files so that we can distinguish the C.elegans results from O.sativa ones.

1. Prepare the "Ce.SpotTypes.txt". From the raw data we found values for several parameters: there is one extra line in each file so that the "skip" parameter in readNimbleMrg should be "1". The titles for "annotation" parameter should be c("PROBE_ID", "X","Y", "POSITION", "SEQ_ID"), and the pattern for recognize non-oligos spots are "RANDOM*". The final "Ce.SpotTypes.txt" file is as following.

SpotType        Probe_Class     Color
Oligo              *        black
Random          RANDOM*         red

2. Prepare the "Ce.Ids.txt". The original data file was as following:

GSM365609_10213802_635.pair
GSM365609_10213802_532.pair
GSM365612_10331202_635.pair
GSM365612_10331202_532.pair

From GEO online information, we know the cy5 (635nm) channel is the ChIP channel and Cy3 (532nm) channel is Input DNA channel, please copy ONLY the chip IDs into the "Ce.Ids.txt" file, and the channel can be specified through the parameter of function "readNimbleMrg" (See the example script at the bottom).

GSM365609_10213802
GSM365612_10331202

They also have Mock IP experiments done. Mock IP control experiments usually follow the whole chromatin immunoprecipitation protocol to collect the mock IP sample except using non-specific antibody or no antibody instead of specific antibody. The immunoprecipitated samples will be hybridize against the same set of genomic DNA input as in the ChIP-chip experiments. To use the mock IP data, please put the mock IP experiment IDs into the Ħ°Ids.txtĦħ file and redo analysis after finished the ChIP-chip experiments analysis. Then only the regions specifically appeared in the ChIP-chip experiment should be considered as the true positive results.

3. Prepare the "Ce.OligoSites.txt". From the design data, we can easily extract the oligo positions information by using awk under Unix/Linux to extract the information into "Ce.OligoSites.txt".

cat GPL8135_Celegans180_ChIP_v2_HX1.pos | awk -F'\t' '{print $3"\t"$4"\t"$2}' | grep -v PROBE_ID > Ce.OligoSites.txt

Or if you are going to use NDF format NimbleGen design file, you can use following command to generate the "OligoSites.txt" file (the difference is only the column index used to extract the correct information):

cat  GPL8135_Celegans180_ChIP_v2_HX1.ndf | awk -F'\t' '{print $13"\t"$14"\t"$5}' | grep -v PROBE_ID > Ce.OligoSites.txt

4A. Prepare the "Ce.Genes.coords" file. We downloaded the realease 180 gene annotation from the WormBase website through the WormMart (WS180 is the assemble version what the tiling array design based on). The default filter is to download all gene names, we selected attributes including Chrom Name, Strand, Start(bp) and End(bp) to get the data we need.The downloaded TSV data named "mart_export.txt" looks like:

Gene WB ID  	Gene Public Name  	Chr Name  	Strand  	Start (bp)  	End (bp)
WBGene00000001 	aap-1 	I 	1 	5107846 	5110167
WBGene00000002 	aat-1 	IV 	-1 	9598963 	9601672
WBGene00000003 	aat-2 	V 	-1 	9244386 	9246334
WBGene00000004 	aat-3 	X 	1 	2554225 	2557708
WBGene00000005 	aat-4 	IV 	-1 	6272575 	6275697
WBGene00000006 	aat-5 	I 	-1 	2571783 	2576021
WBGene00000007 	aat-6 	V 	-1 	11466834 	11470640
WBGene00000008 	aat-7 	II 	1 	1386894 	1391573
WBGene00000009 	aat-8 	IV 	-1 	3862123 	3865024
WBGene00000010 	aat-9 	I 	-1 	11388738 	11396651

It's quite simple to convert it into NTAP formate Genes.coords file by:

cat mart_export.txt | awk -F'\t' '$4=="1"{print $1"\t"$5"\t"$6"\t"$3} $4=="-1"{print $1"\t"$6"\t"$5"\t"$3}' > Ce.Genes.coords
WBGene00000001  5107846 5110167 I
WBGene00000002  9601672 9598963 IV
WBGene00000003  9246334 9244386 V
WBGene00000004  2554225 2557708 X
WBGene00000005  6275697 6272575 IV
WBGene00000006  2576021 2571783 I
WBGene00000007  11470640        11466834        V
WBGene00000008  1386894 1391573 II
WBGene00000009  3865024 3862123 IV
WBGene00000010  11396651        11388738        I

4B. However, user may want to use the most recent genome release instead to take advantage of recent annotations. To achieve such a goal, user have to remap all oligos back to the newest genome sequences. Please skip this section if you still want to keep using WS185 instead.

First of all, prepare the FASTA format oligo sequences for mapping:

cat GPL8135_Celegans180_ChIP_v2_HX1.ndf | awk -F'\t' '$0 !~/PROBE_DESIGN/ {print ">"$13"\n"$6}' > oligos.fas

Next, download the corresponding version genome DNA sequences from ftp://ftp.wormbase.org/pub/wormbase/genomes/c_elegans/sequences/dna/c_elegans.WS195.dna.fa.gz (29M). The file is in FASTA format with each chromosome as one big sequence.

Then, use "BLAT" program to remap all oligos back onto the WS202 genome:

gunzip c_elegans.WS195.dna.fa.gz
blat  c_elegans.WS195.dna.fa oligos.fas oligos.psl

The oligos.psl would contain lines like those:

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
50      0       0       0       0       0       0       0       +       XFS004800112    50      0       50      X       17718854        4800113 4800163 1       50,     0,      4800113,
50      0       0       0       0       0       0       0       +       VFS017520288    50      0       50      V       20924143        17520458        17520508        1       50,     0,      17520458,
50      0       0       0       0       0       0       0       +       VFS011538557    50      0       50      V       20924143        11538728        11538778        1       50,     0,      11538728,
50      0       0       0       0       0       0       0       +       VFS013136768    50      0       50      V       20924143        13136939        13136989        1       50,     0,      13136939,
50      0       0       0       0       0       0       0       +       XFS012482593    50      0       50      X       17718854        12482593        12482643        1       50,     0,      12482593,

Use following command to generate the updated Oligos position file "Ce.OligoSites.txt":

cat oligos.psl | awk -F'\t' '$1==50{print $10"\t"$16"\t"$14}' > Ce.OligoSites.txt
At last , download the WS195 version genes annotation from WormMart and generate the "Ce.Genes.coords" following the 4A section instruction.

5. Prepare the gene lists to compare. We used x>2kb and 100bp<x<1kb as criteria to generate gene lists contain long and short genes. User may specify their own criteria and generate their own gene lists.

cat mart_export.txt | awk -F'\t' '$6-$5<1000 && $6-$5> 100{print $1}' > short.genes
cat mart_export.txt | awk -F'\t' '$6-$5> 2000{print $1}' > long.genes

6. Run the NTAP as following to finish the analysis.

wd <- "."
source("ntap.R")
types <- readSpotTypes("Ce.SpotTypes.txt")
ids <- scan(file="Ce.Ids.txt", what="character", quiet=T, comment.char="#")
oligoSites <- read.table("Ce.OligoSites.txt", header=F, sep="\t")
annotation <- c("PROBE_ID", "X","Y", "POSITION", "SEQ_ID")
dataRaw <- readNimbleMrg(path=paste(wd,"/data",sep="") ,ids, skip=1, refChannel="_635.pair", ipChannel="_532.pair", sep="\t", annotation=annotation, valueTitle="PM", status=T)

plotCor(dataRaw, suffix="NN", path="plots")

dataMA <- normalizeWithinArrays(dataRaw, method="loess")
dataRG <- RG.MA(dataMA)
plotCor(dataRG, suffix="WT", path="plots")
wtMerge <- writeMerge(dataRG, suffix="WT", path="results")

draw <- sortChrom(oligoSites, wtMerge, path="results")

# Use Chromosome III as example. 
pvalue <- wilcoxPvalue(ids, c("III"), draw, swStep=200, segLength=1e05, expName="ntap-example", path="results", alternative="greater", e=1e06)

pvalue2rec("Ce.SDC3-ChIP",c("III"), "Ce.Genes.coords", "Ce.SDC3-ChIP.rec", path="results")

ce.gt2k <- read.table("./genelists/long.genes")
ce.lt1k <- read.table("./genelists/short.genes")

recFiles <- c("Ce.SDC3-ChIP.rec")
recFiles <- setLegendTag(recFiles, c("C.Elegans SDC-3 HD"))


geneLists <- c(ce.gt2k, ce.lt1k)
geneLists <- setLegendTag(geneLists, c("Long", "Short"))

drawSW(recFiles, geneLists,"Ce.SDC3-ChIP", path="results")