SamQC

From Jie Wu Home Page

(Difference between revisions)
Jump to: navigation, search
(A quick example)
 
(4 intermediate revisions by one user not shown)
Line 13: Line 13:
 
Please install samtools, htseq-count, bedtools (2.17.0 preferred),  R and bwa (0.6.1 preferred) before running this script.
 
Please install samtools, htseq-count, bedtools (2.17.0 preferred),  R and bwa (0.6.1 preferred) before running this script.
  
These programs here are required if you want to run the full pipeline, samtools and bedtools are mandatory for basic functions.
+
Samtools and bedtools are mandatory for basic functions. The others are for advanced functions.
 +
 
 +
==A quick protocol==
 +
 
 +
First, use '''BWA''' to map the reads to the genome, and get the BAM format output file. Then run the following command line:
 +
 
 +
<pre>
 +
perl samQC.pl input.bam --config rn5.conf --out output --plotdist
 +
</pre>
 +
 
 +
Please modify the paths in "rn5.conf" so that the program can locate the annotation files accordingly.
  
 
==The options==
 
==The options==
Line 31: Line 41:
 
|-
 
|-
 
|InputBAMFile     
 
|InputBAMFile     
|bam format mapping result, mapped to the genome (not transcriptome). BWA, TopHat or OLego are recommended. Must be mapped to the same version of genome indicated by the configuration file.
+
|bam format mapping result, please map the reads to the genome (not transcriptome) using BWA or OLego. The reads must be mapped to the same version of genome indicated by the configuration file.
 
|-
 
|-
 
|bedfileName
 
|bedfileName
|transcriptome information, ensembl bed downloaded from UCSC.
+
|transcriptome information, ensembl bed format annotation downloaded from UCSC genome table browser.
 
|-
 
|-
 
| --rmsk
 
| --rmsk
Line 67: Line 77:
 
|-
 
|-
 
| --ensemblbam     
 
| --ensemblbam     
|bam input has ensembl chrom style
+
|bam input has ensembl chrom style (testing)
 
|-
 
|-
 
| --downsample     
 
| --downsample     
|down-sample to this number of reads [default: 1000000 ]
+
|down-sample to this number of reads [default: 1,000,000 ]
 
|-       
 
|-       
 
| --out           
 
| --out           
Line 118: Line 128:
 
  perl ~/scripts/pipeline/samQC/Summarize.pl  
 
  perl ~/scripts/pipeline/samQC/Summarize.pl  
  
in this folder
+
Then you will see some figures generated in this folder (if there is no bug..)
 
+
'''Read more if you are interested...'''
+
 
+
  
 
==The config file==
 
==The config file==

Latest revision as of 00:38, 15 October 2014

Contents

[edit] Overview

This script can be used to QC (Quality Control) bam files. By default, it downsamples the bam file and

  • (1) checks the distribution of the reads (exonic, intronic, UTR, intergenic, repetitive, etc.),checks rRNA by mapping to the rRNA fasta file;
  • (2) strand specificity
  • (2) checks the top 30 expressed genes and output the gene expressions;
  • (3) plot the read depths across transcripts (3' to 5' bias)

The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent you.

[edit] Prepare to run the code

Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script.

Samtools and bedtools are mandatory for basic functions. The others are for advanced functions.

[edit] A quick protocol

First, use BWA to map the reads to the genome, and get the BAM format output file. Then run the following command line:

 perl samQC.pl input.bam --config rn5.conf --out output --plotdist

Please modify the paths in "rn5.conf" so that the program can locate the annotation files accordingly.

[edit] The options

The detailed help information:

perl /path/to_samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]

Basic options:

Option Description
InputBAMFile bam format mapping result, please map the reads to the genome (not transcriptome) using BWA or OLego. The reads must be mapped to the same version of genome indicated by the configuration file.
bedfileName transcriptome information, ensembl bed format annotation downloaded from UCSC genome table browser.
--rmsk This option is not mandatory. rmsk.txt file downloaded from UCSC genome browser table
--ensgtf supply this file for gene expression qc, ens gtf downloaded from UCSC
--enstxt supply this file for gene expression qc, ens txt downloaded from UCSC
--id2name supply this for conversion between gene ids and gene names, from biomart
--plotdist plot 5' to 3' read distribution
--toplistnum the number of top expressed transcripts used for --plotdist [default: 200]
--rRNAqc supply this fasta file for more accurate rRNA contamination report, sequences will be mapped to this file instead of using pre-aligned bam files.
--genomesize provide this to compute intergeic/exon density ratio
--o2 Use this output prefix for geneexp and coveragePos instead of --out
--keeptmp keep the tmp file, for debugging
--ensemblbam bam input has ensembl chrom style (testing)
--downsample down-sample to this number of reads [default: 1,000,000 ]
--out Output prefix [default: samQCresult ]
--config Use this config file to get options (bedfileName, --rmsk, --ensgtf, --enstxt, --id2name, these options will be ignored if they are specified in conf file)

[edit] Example output

 # total number of reads:        35550426
 # Downsampled to:       1000000
 # Mapped reads: 864117  86.412%
 # distn
 # CDS   UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic
 335854  296241  71294   242919  122140  33174   76708
 38.867% 34.283% 8.251%  28.112% 14.135% 3.839%  8.877%
 # region size
 36243885        87461430        23543387        44487753        971740637       215515932
 # DNA   LINE    Low_complexity  LTR     Other   RC      RNA     rRNA    Satellite       scRNA   Simple_repeat    SINE    snRNA   srpRNA  tRNA    Unknown
 1026    5057    305     9327    28      0       7041    7017    464     16      1322    4348    3       1       4       529
 1.338%  6.593%  0.398%  12.159% 0.037%  0.000%  9.179%  9.148%  0.605%  0.021%  1.723%  5.668%  0.004%  0.001%  0.005%  0.690%
# Top30 expressed genes:
ENSMUSG00000052305      Hbb-bs  16239   1.630%
ENSMUSG00000073940      Hbb-bt  8767    0.880%
ENSMUSG00000006574      Slc4a1  4691    0.471%
ENSMUSG00000069919      Hba-a1  4171    0.419%
ENSMUSG00000035202      Lars2   3395    0.341%
.........
.........
#Total  Mapped  Mapped% CDS     UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic      exon/int
ron     5/3utr  exon/intergenic sense/antisense rRNA%
35550426        864117  86.412  38.867% 34.283% 8.273%  28.517% 14.135% 3.839%  8.877%  58.472  0.550
131.129 0.008   1.438%

[edit] Summarizing script

After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.

copy or link the .*geneexp.txt and *coveragePos.txt files to a folder

run:

perl ~/scripts/pipeline/samQC/Summarize.pl 

Then you will see some figures generated in this folder (if there is no bug..)

[edit] The config file

  • bedFileName: for option --bedfileName

ensembl BED annotation downloaded from ucsc table browser, this file defines the structure of the transcriptome. From this file, samqc checks the distribution of the reads in cds, utr, intron, flanking, intergenic. It doesn't have to be from ensembl annotation, refseq or ucsc genes also work.

  • rmskFileName: for option --rmsk

rmsk txt annotation downloaded from ucsc table browser, has to be the original TXT format. This file is not mandatory.

  • gtfFileName: for option --ensgtf

ensembl gtf annotation downloaded from ucsc table browser.

  • txtFileName: for option --enstxt

ensembl txt annotation downloaded from ucsc table browser

  • id2nameFileName: for option

ensembl id to gene name downloaded from http://www.ensembl.org/biomart/martview/

  • rRNAqc: for option --rRNAqc

a fasta file to be used to map the downsampled reads to find percentage of rRNA reads

  • genomesize: for option --genomesize

This is for exon/intergenic ratio estimation, use the following code for this:

   

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.size 

An example is here:

bedFileName     /home/Genomes/samQC/mm9.ens.bed
rmskFileName    /home/Genomes/samQC/mm9.rmsk.txt
gtfFileName     /home/Genomes/samQC/mm9.ens.gtf
txtFileName     /home/Genomes/samQC/mm9.ens.txt
id2nameFileName /home/Genomes/samQC/mm9.ens_id.name.txt
rRNAqc          /home/Genomes/samQC/mm9.rRNA.fa
genomesize      /home/Genomes/samQC/mm9.sizes

NOTE that the script will generate more supporting files for the first run.