SamQC

Revision as of 02:09, 12 October 2014

Overview

This script can be used to QC (Quality Control) bam files. By default, it downsamples the bam file and

(1) checks the distribution of the reads (exonic, intronic, UTR, intergenic, repetitive, etc.),checks rRNA by mapping to the rRNA fasta file;
(2) strand specificity
(2) checks the top 30 expressed genes and output the gene expressions;
(3) plot the read depths across transcripts (3' to 5' bias)

The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent you.

Prepare to run the code

Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script.

These programs here are required if you want to run the full pipeline, samtools and bedtools are mandatory for basic functions.

The options

The detailed help information:

perl /path/to_samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]

Basic options:

Option	Description
InputBAMFile	bam format mapping result, mapped to the genome (not transcriptome). BWA, TopHat or OLego are recommended. Must be mapped to the same version of genome indicated by the configuration file.
bedfileName	transcriptome information, ensembl bed downloaded from UCSC.
--rmsk	This option is not mandatory. rmsk.txt file downloaded from UCSC genome browser table
--ensgtf	supply this file for gene expression qc, ens gtf downloaded from UCSC
--enstxt	supply this file for gene expression qc, ens txt downloaded from UCSC
--id2name	supply this for conversion between gene ids and gene names, from biomart
--plotdist	plot 5' to 3' read distribution
--toplistnum	the number of top expressed transcripts used for --plotdist [default: 200]
--rRNAqc	supply this fasta file for more accurate rRNA contamination report, sequences will be mapped to this file instead of using pre-aligned bam files.
--genomesize	provide this to compute intergeic/exon density ratio
--o2	Use this output prefix for geneexp and coveragePos instead of --out
--keeptmp	keep the tmp file, for debugging
--ensemblbam	bam input has ensembl chrom style
--downsample	down-sample to this number of reads [default: 1000000 ]
--out	Output prefix [default: samQCresult ]
--config	Use this config file to get options (bedfileName, --rmsk, --ensgtf, --enstxt, --id2name, these options will be ignored if they are specified in conf file)

Example output

 # total number of reads:        35550426
 # Downsampled to:       1000000
 # Mapped reads: 864117  86.412%
 # distn
 # CDS   UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic
 335854  296241  71294   242919  122140  33174   76708
 38.867% 34.283% 8.251%  28.112% 14.135% 3.839%  8.877%
 # region size
 36243885        87461430        23543387        44487753        971740637       215515932
 # DNA   LINE    Low_complexity  LTR     Other   RC      RNA     rRNA    Satellite       scRNA   Simple_repeat    SINE    snRNA   srpRNA  tRNA    Unknown
 1026    5057    305     9327    28      0       7041    7017    464     16      1322    4348    3       1       4       529
 1.338%  6.593%  0.398%  12.159% 0.037%  0.000%  9.179%  9.148%  0.605%  0.021%  1.723%  5.668%  0.004%  0.001%  0.005%  0.690%
# Top30 expressed genes:
ENSMUSG00000052305      Hbb-bs  16239   1.630%
ENSMUSG00000073940      Hbb-bt  8767    0.880%
ENSMUSG00000006574      Slc4a1  4691    0.471%
ENSMUSG00000069919      Hba-a1  4171    0.419%
ENSMUSG00000035202      Lars2   3395    0.341%
.........
.........
#Total  Mapped  Mapped% CDS     UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic      exon/int
ron     5/3utr  exon/intergenic sense/antisense rRNA%
35550426        864117  86.412  38.867% 34.283% 8.273%  28.517% 14.135% 3.839%  8.877%  58.472  0.550
131.129 0.008   1.438%

Summarizing script

After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.

copy or link the .*geneexp.txt and *coveragePos.txt files to a folder

run:

perl ~/scripts/pipeline/samQC/Summarize.pl

in this folder

Read more if you are interested...

The config file

bedFileName: for option --bedfileName

ensembl BED annotation downloaded from ucsc table browser, this file defines the structure of the transcriptome. From this file, samqc checks the distribution of the reads in cds, utr, intron, flanking, intergenic. It doesn't have to be from ensembl annotation, refseq or ucsc genes also work.

rmskFileName: for option --rmsk

rmsk txt annotation downloaded from ucsc table browser, has to be the original TXT format. This file is not mandatory.

gtfFileName: for option --ensgtf

ensembl gtf annotation downloaded from ucsc table browser.

txtFileName: for option --enstxt

ensembl txt annotation downloaded from ucsc table browser

id2nameFileName: for option

ensembl id to gene name downloaded from http://www.ensembl.org/biomart/martview/

rRNAqc: for option --rRNAqc

a fasta file to be used to map the downsampled reads to find percentage of rRNA reads

genomesize: for option --genomesize

This is for exon/intergenic ratio estimation, use the following code for this:

   

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.size

An example is here:

bedFileName     /home/Genomes/samQC/mm9.ens.bed
rmskFileName    /home/Genomes/samQC/mm9.rmsk.txt
gtfFileName     /home/Genomes/samQC/mm9.ens.gtf
txtFileName     /home/Genomes/samQC/mm9.ens.txt
id2nameFileName /home/Genomes/samQC/mm9.ens_id.name.txt
rRNAqc          /home/Genomes/samQC/mm9.rRNA.fa
genomesize      /home/Genomes/samQC/mm9.sizes

NOTE that the script will generate more supporting files for the first run.

Search

Navigation