SamQC
From Jie Wu Home Page
(→A quick example) |
|||
(4 intermediate revisions by one user not shown) | |||
Line 13: | Line 13: | ||
Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script. | Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script. | ||
− | + | Samtools and bedtools are mandatory for basic functions. The others are for advanced functions. | |
+ | |||
+ | ==A quick protocol== | ||
+ | |||
+ | First, use '''BWA''' to map the reads to the genome, and get the BAM format output file. Then run the following command line: | ||
+ | |||
+ | <pre> | ||
+ | perl samQC.pl input.bam --config rn5.conf --out output --plotdist | ||
+ | </pre> | ||
+ | |||
+ | Please modify the paths in "rn5.conf" so that the program can locate the annotation files accordingly. | ||
==The options== | ==The options== | ||
Line 31: | Line 41: | ||
|- | |- | ||
|InputBAMFile | |InputBAMFile | ||
− | |bam format mapping result, | + | |bam format mapping result, please map the reads to the genome (not transcriptome) using BWA or OLego. The reads must be mapped to the same version of genome indicated by the configuration file. |
|- | |- | ||
|bedfileName | |bedfileName | ||
− | |transcriptome information, ensembl bed downloaded from UCSC. | + | |transcriptome information, ensembl bed format annotation downloaded from UCSC genome table browser. |
|- | |- | ||
| --rmsk | | --rmsk | ||
Line 67: | Line 77: | ||
|- | |- | ||
| --ensemblbam | | --ensemblbam | ||
− | |bam input has ensembl chrom style | + | |bam input has ensembl chrom style (testing) |
|- | |- | ||
| --downsample | | --downsample | ||
− | |down-sample to this number of reads [default: | + | |down-sample to this number of reads [default: 1,000,000 ] |
|- | |- | ||
| --out | | --out | ||
Line 118: | Line 128: | ||
perl ~/scripts/pipeline/samQC/Summarize.pl | perl ~/scripts/pipeline/samQC/Summarize.pl | ||
− | in this folder | + | Then you will see some figures generated in this folder (if there is no bug..) |
− | + | ||
− | + | ||
− | + | ||
==The config file== | ==The config file== |
Latest revision as of 00:38, 15 October 2014
Contents |
[edit] Overview
This script can be used to QC (Quality Control) bam files. By default, it downsamples the bam file and
- (1) checks the distribution of the reads (exonic, intronic, UTR, intergenic, repetitive, etc.),checks rRNA by mapping to the rRNA fasta file;
- (2) strand specificity
- (2) checks the top 30 expressed genes and output the gene expressions;
- (3) plot the read depths across transcripts (3' to 5' bias)
The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent you.
[edit] Prepare to run the code
Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script.
Samtools and bedtools are mandatory for basic functions. The others are for advanced functions.
[edit] A quick protocol
First, use BWA to map the reads to the genome, and get the BAM format output file. Then run the following command line:
perl samQC.pl input.bam --config rn5.conf --out output --plotdist
Please modify the paths in "rn5.conf" so that the program can locate the annotation files accordingly.
[edit] The options
The detailed help information:
perl /path/to_samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]
Basic options:
Option | Description |
---|---|
InputBAMFile | bam format mapping result, please map the reads to the genome (not transcriptome) using BWA or OLego. The reads must be mapped to the same version of genome indicated by the configuration file. |
bedfileName | transcriptome information, ensembl bed format annotation downloaded from UCSC genome table browser. |
--rmsk | This option is not mandatory. rmsk.txt file downloaded from UCSC genome browser table |
--ensgtf | supply this file for gene expression qc, ens gtf downloaded from UCSC |
--enstxt | supply this file for gene expression qc, ens txt downloaded from UCSC |
--id2name | supply this for conversion between gene ids and gene names, from biomart |
--plotdist | plot 5' to 3' read distribution |
--toplistnum | the number of top expressed transcripts used for --plotdist [default: 200] |
--rRNAqc | supply this fasta file for more accurate rRNA contamination report, sequences will be mapped to this file instead of using pre-aligned bam files. |
--genomesize | provide this to compute intergeic/exon density ratio |
--o2 | Use this output prefix for geneexp and coveragePos instead of --out |
--keeptmp | keep the tmp file, for debugging |
--ensemblbam | bam input has ensembl chrom style (testing) |
--downsample | down-sample to this number of reads [default: 1,000,000 ] |
--out | Output prefix [default: samQCresult ] |
--config | Use this config file to get options (bedfileName, --rmsk, --ensgtf, --enstxt, --id2name, these options will be ignored if they are specified in conf file) |
[edit] Example output
# total number of reads: 35550426 # Downsampled to: 1000000 # Mapped reads: 864117 86.412% # distn # CDS UTR UTR5 UTR3 Intron flanking(+-3k) intergenic 335854 296241 71294 242919 122140 33174 76708 38.867% 34.283% 8.251% 28.112% 14.135% 3.839% 8.877% # region size 36243885 87461430 23543387 44487753 971740637 215515932 # DNA LINE Low_complexity LTR Other RC RNA rRNA Satellite scRNA Simple_repeat SINE snRNA srpRNA tRNA Unknown 1026 5057 305 9327 28 0 7041 7017 464 16 1322 4348 3 1 4 529 1.338% 6.593% 0.398% 12.159% 0.037% 0.000% 9.179% 9.148% 0.605% 0.021% 1.723% 5.668% 0.004% 0.001% 0.005% 0.690% # Top30 expressed genes: ENSMUSG00000052305 Hbb-bs 16239 1.630% ENSMUSG00000073940 Hbb-bt 8767 0.880% ENSMUSG00000006574 Slc4a1 4691 0.471% ENSMUSG00000069919 Hba-a1 4171 0.419% ENSMUSG00000035202 Lars2 3395 0.341% ......... ......... #Total Mapped Mapped% CDS UTR UTR5 UTR3 Intron flanking(+-3k) intergenic exon/int ron 5/3utr exon/intergenic sense/antisense rRNA% 35550426 864117 86.412 38.867% 34.283% 8.273% 28.517% 14.135% 3.839% 8.877% 58.472 0.550 131.129 0.008 1.438%
[edit] Summarizing script
After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.
copy or link the .*geneexp.txt and *coveragePos.txt files to a folder
run:
perl ~/scripts/pipeline/samQC/Summarize.pl
Then you will see some figures generated in this folder (if there is no bug..)
[edit] The config file
- bedFileName: for option --bedfileName
ensembl BED annotation downloaded from ucsc table browser, this file defines the structure of the transcriptome. From this file, samqc checks the distribution of the reads in cds, utr, intron, flanking, intergenic. It doesn't have to be from ensembl annotation, refseq or ucsc genes also work.
- rmskFileName: for option --rmsk
rmsk txt annotation downloaded from ucsc table browser, has to be the original TXT format. This file is not mandatory.
- gtfFileName: for option --ensgtf
ensembl gtf annotation downloaded from ucsc table browser.
- txtFileName: for option --enstxt
ensembl txt annotation downloaded from ucsc table browser
- id2nameFileName: for option
ensembl id to gene name downloaded from http://www.ensembl.org/biomart/martview/
- rRNAqc: for option --rRNAqc
a fasta file to be used to map the downsampled reads to find percentage of rRNA reads
- genomesize: for option --genomesize
This is for exon/intergenic ratio estimation, use the following code for this:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" > hg19.size
An example is here:
bedFileName /home/Genomes/samQC/mm9.ens.bed rmskFileName /home/Genomes/samQC/mm9.rmsk.txt gtfFileName /home/Genomes/samQC/mm9.ens.gtf txtFileName /home/Genomes/samQC/mm9.ens.txt id2nameFileName /home/Genomes/samQC/mm9.ens_id.name.txt rRNAqc /home/Genomes/samQC/mm9.rRNA.fa genomesize /home/Genomes/samQC/mm9.sizes
NOTE that the script will generate more supporting files for the first run.