SamQC

From Jie Wu Home Page

(Difference between revisions)
Jump to: navigation, search
Line 7: Line 7:
 
*(3) plot the read depths across transcripts (3' to 5' bias)
 
*(3) plot the read depths across transcripts (3' to 5' bias)
  
The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent to you.
+
The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent you.
 
+
==Quick Guide for BMC==
+
 
+
The following command will work for rous users
+
 
+
<pre>
+
 
+
module add r/2.15.3;
+
module add samtools;
+
module add python;
+
module add numpy;
+
module add bedtools/2.17.0;
+
perl ~jiewu/scripts/pipeline/samQC/samQC.pl filename.bam \
+
--config /home/Genomes/samQC/mm9.conf \
+
--out filename.samqc --plotdist --bmcmode 1>log 2>err
+
 
+
</pre>
+
 
+
The only thing you need to change is the species, replace mm9 with other species (hg19, ce10,rn5 and zv9 configuration files can be found in the same folder). You also need to change the "filename".
+
 
+
After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.
+
 
+
copy or link the .*geneexp.txt and *coveragePos.txt files to a folder
+
 
+
run:
+
 
+
perl ~/scripts/pipeline/samQC/Summarize.pl
+
 
+
in this folder
+
 
+
'''Read more if you are interested...'''
+
  
 
==Prepare to run the code==
 
==Prepare to run the code==
  
Please load samtools, python, bedtools/2.17.0, numpy, r/2.15.3 and bwa/0.6.1 before running this script.
+
Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script.
  
The programs here are required if you want to run the full pipeline, samtools and bedtools are mandatory.
+
These programs here are required if you want to run the full pipeline, samtools and bedtools are mandatory for basic functions.
  
 
==The options==
 
==The options==
 
  
 
The detailed help information:
 
The detailed help information:
  
 
<pre>
 
<pre>
perl /home/jiewu/scripts/pipeline/samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]
+
perl /path/to_samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]
 
</pre>
 
</pre>
  
Line 140: Line 108:
 
</pre>
 
</pre>
  
 +
==Summarizing script==
 +
 +
After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.
 +
 +
copy or link the .*geneexp.txt and *coveragePos.txt files to a folder
 +
 +
run:
 +
 +
perl ~/scripts/pipeline/samQC/Summarize.pl
 +
 +
in this folder
 +
 +
'''Read more if you are interested...'''
  
  

Revision as of 02:09, 12 October 2014

Contents

Overview

This script can be used to QC (Quality Control) bam files. By default, it downsamples the bam file and

  • (1) checks the distribution of the reads (exonic, intronic, UTR, intergenic, repetitive, etc.),checks rRNA by mapping to the rRNA fasta file;
  • (2) strand specificity
  • (2) checks the top 30 expressed genes and output the gene expressions;
  • (3) plot the read depths across transcripts (3' to 5' bias)

The script now uses MANY annotations, this complexity will be reduced in the future. For now, for simplicity, a configuration file can be used to load the locations of the annotations. Please see option --config. Precompiled configuration files are provided in the package I sent you.

Prepare to run the code

Please install samtools, htseq-count, bedtools (2.17.0 preferred), R and bwa (0.6.1 preferred) before running this script.

These programs here are required if you want to run the full pipeline, samtools and bedtools are mandatory for basic functions.

The options

The detailed help information:

perl /path/to_samQC/samQC.pl InputBAMFile [ bedfileName ] [--options]

Basic options:

Option Description
InputBAMFile bam format mapping result, mapped to the genome (not transcriptome). BWA, TopHat or OLego are recommended. Must be mapped to the same version of genome indicated by the configuration file.
bedfileName transcriptome information, ensembl bed downloaded from UCSC.
--rmsk This option is not mandatory. rmsk.txt file downloaded from UCSC genome browser table
--ensgtf supply this file for gene expression qc, ens gtf downloaded from UCSC
--enstxt supply this file for gene expression qc, ens txt downloaded from UCSC
--id2name supply this for conversion between gene ids and gene names, from biomart
--plotdist plot 5' to 3' read distribution
--toplistnum the number of top expressed transcripts used for --plotdist [default: 200]
--rRNAqc supply this fasta file for more accurate rRNA contamination report, sequences will be mapped to this file instead of using pre-aligned bam files.
--genomesize provide this to compute intergeic/exon density ratio
--o2 Use this output prefix for geneexp and coveragePos instead of --out
--keeptmp keep the tmp file, for debugging
--ensemblbam bam input has ensembl chrom style
--downsample down-sample to this number of reads [default: 1000000 ]
--out Output prefix [default: samQCresult ]
--config Use this config file to get options (bedfileName, --rmsk, --ensgtf, --enstxt, --id2name, these options will be ignored if they are specified in conf file)

Example output

 # total number of reads:        35550426
 # Downsampled to:       1000000
 # Mapped reads: 864117  86.412%
 # distn
 # CDS   UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic
 335854  296241  71294   242919  122140  33174   76708
 38.867% 34.283% 8.251%  28.112% 14.135% 3.839%  8.877%
 # region size
 36243885        87461430        23543387        44487753        971740637       215515932
 # DNA   LINE    Low_complexity  LTR     Other   RC      RNA     rRNA    Satellite       scRNA   Simple_repeat    SINE    snRNA   srpRNA  tRNA    Unknown
 1026    5057    305     9327    28      0       7041    7017    464     16      1322    4348    3       1       4       529
 1.338%  6.593%  0.398%  12.159% 0.037%  0.000%  9.179%  9.148%  0.605%  0.021%  1.723%  5.668%  0.004%  0.001%  0.005%  0.690%
# Top30 expressed genes:
ENSMUSG00000052305      Hbb-bs  16239   1.630%
ENSMUSG00000073940      Hbb-bt  8767    0.880%
ENSMUSG00000006574      Slc4a1  4691    0.471%
ENSMUSG00000069919      Hba-a1  4171    0.419%
ENSMUSG00000035202      Lars2   3395    0.341%
.........
.........
#Total  Mapped  Mapped% CDS     UTR     UTR5    UTR3    Intron  flanking(+-3k)  intergenic      exon/int
ron     5/3utr  exon/intergenic sense/antisense rRNA%
35550426        864117  86.412  38.867% 34.283% 8.273%  28.517% 14.135% 3.839%  8.877%  58.472  0.550
131.129 0.008   1.438%

Summarizing script

After you run the scripts for different bam files, you can use Summarize.pl to summarize the results, it will generate figures for 3'/5' bias and correlation between gene expressions.

copy or link the .*geneexp.txt and *coveragePos.txt files to a folder

run:

perl ~/scripts/pipeline/samQC/Summarize.pl 

in this folder

Read more if you are interested...


The config file

  • bedFileName: for option --bedfileName

ensembl BED annotation downloaded from ucsc table browser, this file defines the structure of the transcriptome. From this file, samqc checks the distribution of the reads in cds, utr, intron, flanking, intergenic. It doesn't have to be from ensembl annotation, refseq or ucsc genes also work.

  • rmskFileName: for option --rmsk

rmsk txt annotation downloaded from ucsc table browser, has to be the original TXT format. This file is not mandatory.

  • gtfFileName: for option --ensgtf

ensembl gtf annotation downloaded from ucsc table browser.

  • txtFileName: for option --enstxt

ensembl txt annotation downloaded from ucsc table browser

  • id2nameFileName: for option

ensembl id to gene name downloaded from http://www.ensembl.org/biomart/martview/

  • rRNAqc: for option --rRNAqc

a fasta file to be used to map the downsampled reads to find percentage of rRNA reads

  • genomesize: for option --genomesize

This is for exon/intergenic ratio estimation, use the following code for this:

   

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.size 

An example is here:

bedFileName     /home/Genomes/samQC/mm9.ens.bed
rmskFileName    /home/Genomes/samQC/mm9.rmsk.txt
gtfFileName     /home/Genomes/samQC/mm9.ens.gtf
txtFileName     /home/Genomes/samQC/mm9.ens.txt
id2nameFileName /home/Genomes/samQC/mm9.ens_id.name.txt
rRNAqc          /home/Genomes/samQC/mm9.rRNA.fa
genomesize      /home/Genomes/samQC/mm9.sizes

NOTE that the script will generate more supporting files for the first run.