测序数据的深度、覆盖度等计算

测序数据的深度、覆盖度等计算

Coverage Depth

coverage_depth

覆盖深度 mapping depth

基因组被测序片段(短读 short reads)“覆盖”的强度有多大?

每一碱基的覆盖率是基因组碱基被测序的平均次数。 基因组的覆盖深度是通过与基因组匹配的所有短读的碱基数目除以该基因组的长度来计算的。 它通常表示为1X、2X、3X、...(1、2或3倍覆盖)。

此处通常被称为测序深度(sequencing depth)或者覆盖深度(depth of coverage)。

Depth 计算公式

原始测序数据的深度,Depth = (# sequenced bases) / (# bases of reference) 或者

比对上的数据的深度,Depth= (# bases of all mapped reads) / (# bases of reference).

覆盖范围 covered length

短读“覆盖”了基因组的多少区域?是否有些区域没有任何的短读覆盖?

覆盖范围是参考基因组被一定深度覆盖的碱基的百分比。例如,一个基因组的90%区域在1X深度覆盖,另外仍然有70%的区域被覆盖了5X深度。

此处通常称为覆盖度(genome covarage)或者覆盖范围(breadth of coverage)。

Coverage 计算公式

Coverage = (# area covered by mapped reads) / (# area of reference).

例子

example

对于全基因组

Depth = (6 * 28nt) / 112nt = 1.2 fold

Coverage = (46nt - 5nt) / 112nt = 36.6%

对于target区域

Depth = (6 * 28nt) / 46nt = 3.7 fold

Coverage = (46nt - 5nt) / 46nt = 89.1%

对于position

Depth = 6 fold

reads数

下机的fq文件中到底有多少条reads?可以使用软件readfq来统计:

git clone https://github.com/billzt/readfq

gcc -o kseq_fastq_base kseq_fastq_base.c -lz

kseq_fastq_base samp.fq

#FASTQ files could be gzipped.

#The statistic results would be printed on STDOUT.

功能单一,优化了速度:

It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!!

现在已有很多工具都会提供这个功能,间接或者直接计算测序深度和覆盖度,比如:samtools、sambamba、bedtools、qualimap、gatk等等,下面对各个工具的使用和结果做一点点汇整。

计算测序深度和覆盖度

samtools or sambamba

depth可以得到每一个碱基位点上的测序深度:

samtools depth samp.bam > base_depth.txt

sambamba depth -t 10 -w 500 samp.bam > 500base_depth.txt

# -w breadth of the window, in bp, 计算窗口的平均深度

stats对比对后的bam文件进行全面的统计

samtools stats --threads 10 samp.bam > samp_bamstat.txt

grep "^SN" samp_bamstat.txt

SN raw total sequences: 17870982

SN filtered sequences: 0

SN sequences: 17870982

SN is sorted: 0

SN 1st fragments: 8935491

SN last fragments: 8935491

SN reads mapped: 17870982

SN reads mapped and paired: 17870982 # paired-end technology bit set + both mates mapped

SN reads unmapped: 0

SN reads properly paired: 5208824 # proper-pair bit set

SN reads paired: 17870982 # paired-end technology bit set

SN reads duplicated: 0 # PCR or optical duplicate bit set

SN reads MQ0: 556465 # mapped and MQ=0

SN reads QC failed: 0

SN non-primary alignments: 0

SN total length: 1804969182 # ignores clipping

SN bases mapped: 1804969182 # ignores clipping

SN bases mapped (cigar): 1804969182 # more accurate

SN bases trimmed: 0

SN bases duplicated: 0

SN mismatches: 9941932 # from NM fields

SN error rate: 5.508089e-03 # mismatches / bases mapped (cigar)

SN average length: 101

SN maximum length: 101

SN average quality: 35.3

SN insert size average: 1330.1

SN insert size standard deviation: 1341.6

SN inward oriented pairs: 1418095

SN outward oriented pairs: 564408

SN pairs with other orientation: 6745980

SN pairs on different chromosomes: 207008

接下来可以使用plot-bamstats进行结果的可视化,plot-bamstats是samtools中包含的一个软件,可以在samtools的根目录找到,查看使用:

plot-bamstats --help

About: Parses output of samtools stats (former bamcheck) and calls gnuplot to create graphs.

Usage: plot-bamstats [OPTIONS] file.bam.bc

plot-bamstats -p outdir/ file.bam.bc

Options:

-k, --keep-files Do not remove temporary files.

-l, --log-y Set the Y axis scale of the Insert Size plot to log 10.

-m, --merge Merge multiple bamstats files and output to STDOUT.

-p, --prefix The output files prefix, add a slash to create new directory.

-r, --ref-stats Optional reference stats file with expected GC content (created with -s).

-s, --do-ref-stats Calculate reference sequence GC for later use with -r

-t, --targets Restrict -s to the listed regions (tab-delimited chr,from,to. 1-based, inclusive)

-h, -?, --help This help message.

例子如下:

plot-bamstats -s /home/luna/Desktop/database/homo_bwa/hsa.fa > gc

plot-bamstats -r gc -p ./test samp_bamstat.txt

先计算基因组的gc含量分布,再给出一系列文件和图片,还有一个html文件:

ll test* gc

-rw-rw-r-- 1 luna luna 1910 Oct 14 16:04 gc

-rw-rw-r-- 1 luna luna 4266 Oct 14 16:04 test-acgt-cycles.gp

-rw-rw-r-- 1 luna luna 16707 Oct 14 16:04 test-acgt-cycles.png

-rw-rw-r-- 1 luna luna 429 Oct 14 16:04 test-coverage.gp

-rw-rw-r-- 1 luna luna 10497 Oct 14 16:04 test-coverage.png

-rw-rw-r-- 1 luna luna 4736 Oct 14 16:04 test-gc-content.gp

-rw-rw-r-- 1 luna luna 21895 Oct 14 16:04 test-gc-content.png

-rw-rw-r-- 1 luna luna 1026 Oct 14 16:04 test-gc-depth.gp

-rw-rw-r-- 1 luna luna 23312 Oct 14 16:04 test-gc-depth.png

-rw-rw-r-- 1 luna luna 4606 Oct 14 16:04 test.html

-rw-rw-r-- 1 luna luna 3528 Oct 14 16:04 test-indel-cycles.gp

-rw-rw-r-- 1 luna luna 28786 Oct 14 16:04 test-indel-cycles.png

-rw-rw-r-- 1 luna luna 1739 Oct 14 16:04 test-indel-dist.gp

-rw-rw-r-- 1 luna luna 29595 Oct 14 16:04 test-indel-dist.png

-rw-rw-r-- 1 luna luna 261710 Oct 14 16:04 test-insert-size.gp

-rw-rw-r-- 1 luna luna 16580 Oct 14 16:04 test-insert-size.png

-rw-rw-r-- 1 luna luna 5939 Oct 14 16:04 test-quals2.gp

-rw-rw-r-- 1 luna luna 21727 Oct 14 16:04 test-quals2.png

-rw-rw-r-- 1 luna luna 76931 Oct 14 16:04 test-quals3.gp

-rw-rw-r-- 1 luna luna 84648 Oct 14 16:04 test-quals3.png

-rw-rw-r-- 1 luna luna 2228 Oct 14 16:04 test-quals.gp

-rw-rw-r-- 1 luna luna 47463 Oct 14 16:04 test-quals-hm.gp

-rw-rw-r-- 1 luna luna 34730 Oct 14 16:04 test-quals-hm.png

-rw-rw-r-- 1 luna luna 15932 Oct 14 16:04 test-quals.png

可以打开html文件展示如下:

test

点击小图,也会在旁边展示大图,可以更清晰查看。

BAMStats

BAMStats是建立在Picard Java API上的简单软件工具,可以计算并以图形方式显示从SAM / BAM文件进行质量控制评估时得到的各种指标。

BAMStats为覆盖范围,起始位置,MAPQ值,映射的读取长度和编辑距离提供描述性统计信息。 如果提供了合适的bed或gtf文件,还可以为单个特征(例如外显子或诱饵区域)生成统计指标。

BAMStats需要分配2至6GB的内存(Java程序可以使用-Xmx来设置);另外SAM / BAM文件都可以作为输入,但是必须是排序后的文件。

下载

wget https://jaist.dl.sourceforge.net/project/bamstats/BAMStats-1.25.zip

unzip BAMStats-1.25.zip

rm BAMStats-1.25.zip

cd BAMStats-1.25 && ls

## 两个jar分别是命令行(CLI)和图形用户(GUI)

java -jar BAMStats-1.25/BAMStats-1.25.jar --help

USAGE:

======

Option Description

------ -----------

-d, --distances If selected, edit distance statistics

will also be displayed as a separate

table (optional).

-f, --features Feature file specifying specific

locations to calculate coverage

metrics for (optional)

-h, --help Generate help.

-i, --infile SAM or BAM infile (must be sorted).

-l, --lengths If selected, mapped read length

statistics will also be displayed as

a separate table (optional).

-m, --mapped If selected, coverage statistics, for

mapped regions only, will also be

displayed as a separate table

(optional).

-o, --outfile Outfile (optional).

-q, --qualities If selected, mapping quality (MAPQ)

statistics will also be displayed as

a separate table (optional).

-s, --starts If selected, start position statistics

will also be displayed as a separate

table (optional).

-v, --view View option for output format

(currently accepts 'simple' or

'html'; default, simple).

使用

sambamba sort -o smap.sort.bam -t 20 -p samp.bam

java -jar -Xmx30g BAMStats-1.25.jar -i samp.sort.bam -o BAMStats --view simple

cat BAMStats

ref

N

mean

median

sd

q1

q3

2.50%

97.50%

min

max

chr1

249,250,621

0.63

0

11.29

0

1

0

3

0

15051

chr2

243,199,373

0.63

0

1.43

0

1

0

3

0

706

chr3

198,022,430

0.64

0

1

0

1

0

3

0

102

chr4

191,154,276

0.58

0

1.13

0

1

0

3

0

527

...

chr22

51,304,566

0.36

0

0.91

0

0

0

3

0

207

chrX

155,270,560

0.59

0

1.29

0

1

0

3

0

832

chrY

59,373,566

0.07

0

4.6

0

0

0

0

0

1859

chrMT

16,571

26.14

22

15.47

16

32

7

69.7

0

98

改为html输出,可以看看:

java -jar -Xmx30g BAMStats-1.25.jar -d -l -m -q -s -i samp.sort.bam -o BAMStats2 --view html

会产生一个文件夹BAMStats2.data和一个html文件BAMStats2,进入查看:

cd BAMStats2.data && ls | wc -l

# 531

ll chr1_*

-rw-rw-r-- 1 luna luna 5056 Oct 14 16:43 chr1_Coverage_boxAndWhisker.png

-rw-rw-r-- 1 luna luna 11480 Oct 14 16:43 chr1_Coverage_cumulativeHistogram.png

-rw-rw-r-- 1 luna luna 12513 Oct 14 16:43 chr1_Coverage_histogram.png

-rw-rw-r-- 1 luna luna 753 Oct 14 16:43 chr1_Coverage.html

-rw-rw-r-- 1 luna luna 4528 Oct 14 16:44 chr1_EditDistances_boxAndWhisker.png

-rw-rw-r-- 1 luna luna 10601 Oct 14 16:44 chr1_EditDistances_cumulativeHistogram.png

-rw-rw-r-- 1 luna luna 11156 Oct 14 16:44 chr1_EditDistances_histogram.png

-rw-rw-r-- 1 luna luna 770 Oct 14 16:44 chr1_EditDistances.html

-rw-rw-r-- 1 luna luna 5852 Oct 14 16:43 chr1_MappedCoverage_boxAndWhisker.png

-rw-rw-r-- 1 luna luna 12311 Oct 14 16:43 chr1_MappedCoverage_cumulativeHistogram.png

-rw-rw-r-- 1 luna luna 12688 Oct 14 16:43 chr1_MappedCoverage_histogram.png

-rw-rw-r-- 1 luna luna 784 Oct 14 16:43 chr1_MappedCoverage.html

-rw-rw-r-- 1 luna luna 5107 Oct 14 16:44 chr1_MappingQuality_boxAndWhisker.png

-rw-rw-r-- 1 luna luna 11297 Oct 14 16:44 chr1_MappingQuality_cumulativeHistogram.png

-rw-rw-r-- 1 luna luna 11698 Oct 14 16:44 chr1_MappingQuality_histogram.png

-rw-rw-r-- 1 luna luna 771 Oct 14 16:44 chr1_MappingQuality.html

-rw-rw-r-- 1 luna luna 4979 Oct 14 16:44 chr1_ReadLengths_boxAndWhisker.png

-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_cumulativeHistogram.png

-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_histogram.png

-rw-rw-r-- 1 luna luna 773 Oct 14 16:44 chr1_ReadLengths.html

-rw-rw-r-- 1 luna luna 463 Oct 14 16:43 chr1_StartPositions.html

查看html文件BAMStats2,重命名,同时下载html和文件夹BAMStats2.data到本地,放在同一个目录下,以后使用浏览器打开:

mv BAMStats2 BAMStats2.html

tar -cf Stats2.tar BAMStats2*

sz Stats2.tar

# 本地解压

打开BAMStats2.html,部分结果展示:

test2

也可以到文件夹BAMStats2.data中查看每一条染色体:

chr1raw

chr1mapped

GUI界面也可以看下图:

GUI

GATK DepthOfCoverage

gatk3和gatk4中都有这个工具,DepthOfCoverage,本处使用gatk 4.1.8.1:

which gatk

/home/luna/Desktop/Software/gatk/gatk-4.1.8.1/gatk

# gatk4

gatk DepthOfCoverage -R /home/luna/Desktop/database/homo_bwa/hg19.fa -I samp.sort.bam -O gatk -L chr1 --summary-coverage-threshold 1 --summary-coverage-threshold 4

# gatk3

java -jar /home/luna/Desktop/Software/gatk/GenomeAnalysisTK.jar -T DepthOfCoverage -I samp.sort.bam -R /home/luna/Desktop/database/homo_bwa/hg19.fa --out gatk3 --summaryCoverageThreshold 1 --summaryCoverageThreshold 4 -L chr1

# -L 后面可以接单个染色体的名称,也可以多次使用,或者bed文件包含基因组的多个区域作为限定

# summary Coverage Threshold,可以随意设置数值(N),表示%_above_N,即深度超过N的coverage of breadth.

运行结束会生成多个文件,主要是查看一下summary文件:

no suffix: per locus coverage

_summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases

_statistics: coverage histograms (# locus with X coverage), aggregated over all bases

_interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval

_interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples

_gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene

_gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples

_cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases

_cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases

## 查看

cat gatk.sample_summary

sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_1,%_bases_above_4

SRX247249,155122543,0.69,2,1,1,41.8,2.2

Total,155122543,0.69,N/A,N/A,N/A,N/A,N/A

BamStats

此处有多个bamstats的程序,所以作为区分,列出来。

Biopet BamStats

Biopet BamStats is a package that contains tools to generate stats from a BAM file, merge those stats for multiple samples, and validate the generated stats files.

Mode - Generate

Generate reports clipping stats, flag stats, insert size and mapping quality on a BAM file.

The output of the JSON file is organized in a sample - library - readgroup tree structure. If readgroups in the BAM file are not annotated with sample (SM) and library (LB) tags an error will be thrown. This can be fixed by using samtools addreplacerg or picard AddOrReplaceReadGroups.

Mode - Merge

This module will merge bamstats files together and keep the sample/library/readgroup structure. Values for the same readgroups will be added. It will also validate the resulting file.

Mode - Validate

Validates a BamStats file. If aggregation values can not be regenerated the file is considered corrupt. This should only happen when the file has been manually edited.

website: https://biopet.github.io/bamstats/1.0.1/index.html

wget https://github.com/biopet/bamstats/releases/download/v1.0.1/BamStats-assembly-1.0.1.jar

mv BamStats-assembly-1.0.1.jar BamStats-biopet.jar

mkdir biopet

java -jar /home/luna/Desktop/Software/StatsBAM/BamStats-biopet.jar generate -R /home/luna/Desktop/database/homo_bwa/hg19.fa -o biopet -b samp.sort.bam --tsvOutputs

# 会在biopet生成文件

3prime_clipping.histogram.tsv

5prime_clipping.stats.tsv

clipping.histogram.tsv

insertsize.histogram.tsv

left_clipping.stats.tsv

right_clipping.histogram.tsv

3prime_clipping.stats.tsv

bamstats.json

clipping.stats.tsv

insertsize.stats.tsv

mappingQuality.histogram.tsv

right_clipping.stats.tsv

5prime_clipping.histogram.tsv

bamstats.summary.json

flagstats.tsv

left_clipping.histogram.tsv

mappingQuality.stats.tsv

结果很详细,但是对于json文件我没有找到合适的后续处理,这里仅记录。

guigolab bamstats

wget -O - https://github.com/guigolab/bamstats/releases/download/v0.3.5/bamstats-v0.3.5-linux-x86_64.tar.gz | tar xz -C ./ bamstats

/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -h

bamstats - compute mapping statistics

Usage:

bamstats [flags]

Flags:

-a, --annotaion string element annotation file

-c, --cpu int number of cpus to be used (default 80)

-h, --help help for bamstats

-i, --input string input file (required)

--loglevel string logging level (default "warn")

--max-buf int maximum number of buffered records (default 1000000)

-o, --output string output file (default "-")

-n, --reads int number of records to process (default -1)

-u, --uniq output genomic coverage statistics for uniqely mapped reads too

--version version for bamstats

Bamstats can currently compute the following mapping statistics:

general

genome coverage

RNA-seq

使用,具体上github查看,https://github.com/guigolab/bamstats/blob/master/scripts/

/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -i samp.sort.bam --cpu 20 --uniq --output guigolab.bamstats

bamdst

Bamdst is a lightweight tool to stat the depth coverage of target regions of bam file(s).

Bam file(s) should be properly sorted, and the probe file (bed file) and the output dir must be specified in the first place.

下载安装

git clone https://github.com/shiquan/bamdst.git

cd bamdst/

make

使用

在使用该软件之前,我们首先制作一个bed文件:

cat chr1.region.bed

chr1 64909787 160398769

## PS中间空白是TAB,不是空格

查看使用说明

/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -h

bamdst version: 1.0.9

USAGE : bamdst [OPTION] -p -o [in1.bam [in2.bam ... ]]

or : bamdst [OPTION] -p -o -

Option -o and -p are mandatory:

-o, --outdir output dir

-p, --bed probe or target regions file, the region file will

be merged before calculate depths

测试

mkdir ./bamdst

/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -p chr1.region.bed -o ./bamdst samp.sort.bam

cd ./bamdst

ls

# 列出生成的文件

-coverage.report This file contains all the coverage information of target and flank region, and reads stat information of the input file(s).

-cumu.plot Depth distrbution for plot.

-insert.plot Inferred insert size distribution for plot.

-chromosome.report Depth and coverage information of each chromosome.

-region.tsv.gz For each region in probe file (in.bed), average depth, median depth and coverage of these regions will be listed in the file.

-depth.tsv.gz For each position in the probe file(in.bed), three kinds depth value will be calculated, including raw depth, rmdup depth and the coverage depth.

-uncover.bed This bed file contains the bad covered or uncovered region in the input bam file(s) against the probe file.

查看coverage.report 和 chromosome.report

cat coverage.report

## The file was created by bamdst

## Version : 1.0.9

## Files : samp.sort.bam

[Total] Raw Reads (All reads) 17870982

[Total] QC Fail reads 0

[Total] Raw Data(Mb) 1804.97

[Total] Paired Reads 17870982

[Total] Mapped Reads 17870982

[Total] Fraction of Mapped Reads 100.00%

[Total] Mapped Data(Mb) 1804.97

[Total] Fraction of Mapped Data(Mb) 100.00%

[Total] Properly paired 5208824

[Total] Fraction of Properly paired 29.15%

[Total] Read and mate paired 17870982

[Total] Fraction of Read and mate paired 100.00%

[Total] Singletons 0

[Total] Read and mate map to diff chr 414016

[Total] Read1 8935491

[Total] Read2 8935491

[Total] Read1(rmdup) 8935491

[Total] Read2(rmdup) 8935491

[Total] forward strand reads 8940264

[Total] backward strand reads 8930718

[Total] PCR duplicate reads 0

[Total] Fraction of PCR duplicate reads 0.00%

[Total] Map quality cutoff value 20

[Total] MapQuality above cutoff reads 17045397

[Total] Fraction of MapQ reads in all reads 95.38%

[Total] Fraction of MapQ reads in mapped reads 95.38%

[Target] Target Reads 548096

[Target] Fraction of Target Reads in all reads 3.07%

[Target] Fraction of Target Reads in mapped reads 3.07%

[Target] Target Data(Mb) 55.36

[Target] Target Data Rmdup(Mb) 48.10

[Target] Fraction of Target Data in all data 3.07%

[Target] Fraction of Target Data in mapped data 3.07%

[Target] Len of region 95488982

[Target] Average depth 0.58

[Target] Average depth(rmdup) 0.50

[Target] Coverage (>0x) 32.72%

[Target] Coverage (>=4x) 1.87%

[Target] Coverage (>=10x) 0.09%

[Target] Coverage (>=30x) 0.01%

[Target] Coverage (>=100x) 0.00%

[Target] Target Region Count 1

[Target] Region covered > 0x 0

[Target] Fraction Region covered > 0x 0.00%

[Target] Fraction Region covered >= 4x 0.00%

[Target] Fraction Region covered >= 10x 0.00%

[Target] Fraction Region covered >= 30x 0.00%

[Target] Fraction Region covered >= 100x 0.00%

[flank] flank size 200

[flank] Len of region (not include target region) 200

[flank] Average depth 1.09

[flank] flank Reads 4

[flank] Fraction of flank Reads in all reads 0.00%

[flank] Fraction of flank Reads in mapped reads 0.00%

[flank] flank Data(Mb) 0.00

[flank] Fraction of flank Data in all data 0.00%

[flank] Fraction of flank Data in mapped data 0.00%

[flank] Coverage (>0x) 55.50%

[flank] Coverage (>=4x) 2.00%

[flank] Coverage (>=10x) 0.00%

[flank] Coverage (>=30x) 0.00%

[flank] Coverage (>=100x) 0.00%

cat chromosomes.report

#Chromosome

DATA(%)

Avg depth

Median

Coverage%

Cov 4x %

Cov 10x %

Cov 30x %

Cov 100x %

chr1

100

0.58

0

32.72

1.87

0.09

0.01

0

Qualimap

Qualimap是功能比较全的一款质控软件,提供GUI界面和命令行界面,可以对bam文件,RNA-seq,Counts数据质控,也支持比对数据,counts数据和表观数据的比较和聚类。

Qualimap:http://qualimap.bioinfo.cipf.es/doc_html/index.html

命令行文档:http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#command-line

下载安装

wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip

unzip qualimap_v2.2.1.zip

正常使用qualimap,需要安装部分R包:

# 安装相关的R包

# NOISeq,Repitools, Rsamtools, GenomicFeatures, rtracklayer

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install("NOISeq", version = "3.8")

BiocManager::install("Repitools", version = "3.8")

BiocManager::install("Rsamtools", version = "3.8")

BiocManager::install("GenomicFeatures", version = "3.8")

BiocManager::install("rtracklayer", version = "3.8")

使用

/home/luna/Desktop/Software/qualimap/qualimap bamqc --java-mem-size=40G -c -nt 30 -outdir ./ -outformat PDF -os -outfile samp.bamqc.pdf -bam samp.sort.bam -gd hg19 -nr 2000

mosdepth

Mosdepth是一种新的命令行工具,可以快速计算全基因组测序覆盖率。输入BAM或CRAM文件,计算在基因组中每个核苷酸位置或基因组区域的深度;也可将基因组区域指定为被BED文件覆盖的指定区域,或者指定为CNV calling所需的固定大小窗口。测试来看,Mosdepth比大多数现有工具更快。

下载安装

## github上有编译好的最新版的二进制文件,可以下载使用

wget https://github.com/brentp/mosdepth/releases/download/v0.3.1/mosdepth

chmod +x mosdepth

./mosdepth --help

mosdepth 0.3.1

Usage: mosdepth [options]

Arguments:

outputs: `{prefix}.mosdepth.dist.txt`

`{prefix}.mosdepth.summary.txt`

`{prefix}.per-base.bed.gz` (unless -n/--no-per-base is specified)

`{prefix}.regions.bed.gz` (if --by is specified)

`{prefix}.quantized.bed.gz` (if --quantize is specified)

`{prefix}.thresholds.bed.gz` (if --thresholds is specified)

the alignment file for which to calculate depth.

Common Options:

-t --threads number of BAM decompression threads [default: 0]

-c --chrom chromosome to restrict depth calculation.

-b --by optional BED file or (integer) window-sizes.

-n --no-per-base dont output per-base depth. skipping this output will speed execution

substantially. prefer quantized or thresholded values if possible.

-f --fasta fasta file for use with CRAM files [default: ].

--d4 output per-base depth in d4 format.

Other options:

-F --flag exclude reads with any of the bits in FLAG set [default: 1796]

-i --include-flag only include reads with any of the bits in FLAG set. default is unset. [default: 0]

-x --fast-mode dont look at internal cigar operations or correct mate overlaps (recommended for most use-cases).

-q --quantize write quantized output see docs for description.

-Q --mapq mapping quality threshold. reads with a quality less than this value are ignored [default: 0]

-T --thresholds for each interval in --by, write number of bases covered by at

least threshold bases. Specify multiple integer values separated

by ','.

-m --use-median output median of each region (in --by) instead of mean.

-R --read-groups only calculate depth for these comma-separated read groups IDs.

-h --help show help

参数:

-t 线程数,推荐4个,After about 4 threads, there is no benefit for additional threads

-c 指定染色体去计算深度

-b 指定基因组区域,比如基因的位置,就能计算基因的覆盖度

-F 排除含有该FLAG的reads。默认是1796,代表着:

read unmapped (0x4)

not primary alignment (0x100)

read fails platform/vendor quality checks (0x200)

read is PCR or optical duplicate (0x400)

-Q mapping quality,默认是0,一般可以按照自己的需要进行调整为5或者10

-x 快速模式

-T 按照阀值(比如1X,2X,10X)输出达到指定阀值的核苷酸数目

使用

# To calculate the coverage in each exome capture region

mosdepth --by capture.bed sample-output sample.exome.bam

# coverage thresholds

mosdepth --by capture.bed --thresholds 1,3,6,10 sample-output sample.exome.bam

# WGS example For 500-base windows

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram

indexcov

indexcov是goleft内的一个小工具,goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary。

注意到这个软件,是因为mosdepth和indexcov的作图,可以上github上查看,https://github.com/brentp/goleft/tree/master/indexcov

Quickly estimate coverage from a whole-genome bam or cram index. A bam index has 16KB resolution so that's what this gives, but it provides what appears to be a high-quality coverage estimate in seconds per genome.

The output is scaled to around 1. So a long stretch with values of 1.5 would be a heterozygous duplication. This is useful as a quick QC to get coverage values across the genome.

In our tests, we can estimate depth across 60X genomes for 30 samples in 30 seconds.

测序深度和测序覆盖度,很多时候对于不同的研究有着不同的要求,下面给出网上的一个表格:

Coverage depth recommendations, for human ~3Gb

Category

Detection or Application

Recommended Depth (x) or Reads (millions)

References

Whole genome sequencing

Homozygous SNVs

15x

Bentley et al., 2008

Heterozygous SNVs

33x

Bentley et al., 2008

INDELs

60x

Feng et al., 2014

Genotype calls

35x

Ajay et al., 2011

CNV

1-8x

Xie et al., 2009; Medvedev at al., 2010

Whole exome sequencing

Homozygous SNVs

100x (3x local depth)

Clark et al., 2011; Meynert et al., 2013

Heterozygous SNVs

100x (13x local depth)

Clark et al., 2011; Meynert et al., 2013

INDELs

not recommended

Feng et al., 2014

Transcriptome Sequencing

Differential expression profiling

10-25M

Liu Y. et al., 2014; ENCODE 2011 RNA-Seq

Alternative splicing

50-100M

Liu Y. et al., 2013; ENCODE 2011 RNA-Seq

Allele specific expression

50-100M

Liu Y. et al., 2013; ENCODE 2011 RNA-Seq

De novo assembly

>100M

Liu Y. et al., 2013; ENCODE 2011 RNA-Seq

DNA Target-Based Sequencing

ChIP-Seq

10-14M (sharp peaks); 20-40M (broad marks)

Rozowsky et al., 2009; ENCODE 2011 Genome; Landt et al., 2012

Hi-C

100M

Belton, J.M et al., 2012

4C (Circularized Chromosome Confirmation Capture)

1-5M

van de Weken, H.J.G. et al., 2012

5C (Chromosome Carbon Capture Carbon Copy)

15-25M

Sanyal A. et al., 2012

ChIA-PET (Chromatin Interaction Analysis by Paired-End Tag Sequencing)

15-20M

Zhang, J. et al., 2012

FAIRE-Seq

25-55M

ENCODE 2011 Genome; Landt et al., 2012

DNAse 1-Seq

25-55M

Landt et al., 2012

DNA Methylation Sequencing

CAP-Seq

>20M

Long, H.K. et al., 2013

MeDIP-Seq

60M

Taiwo, O. et al., 2012

RRBS (Reduced Representation Bisulfite Sequencing)

10X

ENCODE 2011 Genome

Bisulfite-Seq

5-15X; 30X

Ziller, M.J et al., 2015; Epigenomics Road Map

RNA-Target-Based Sequencing

CLIP-Seq

10-40M

Cho J. et al., 2012; Eom T. et al., 2013; Sugimoto Y. et al., 2012

iCLIP

5-15M

Sugimoto Y. et al., 2012; Rogelj B. et al., 2012

PAR-CLIP

5-15M

Rogelj B. et al., 2012

RIP-Seq

5-20M

Lu Z. et al., 2014

Small RNA (microRNA) Sequencing

Differential Expression

~1-2M

Metpally RPR et al., 2013; Campbell et al., 2015

Discovery

~5-8M

Metpally RPR et al., 2013; Campbell et al., 2015

References

Recommended Coverage and Read Depth

bamdst

—— dulunar 后记于 2020.10

相关文章

用友t3财务软件如何新建账套 用友t3建立新账套步骤
猫咪公母怎么分?教你轻松辨别猫咪性别的方法与技巧
365bet体育在线下载

猫咪公母怎么分?教你轻松辨别猫咪性别的方法与技巧

📅 06-28 👁️ 2660
阿米奴:尼日利亚篮球领军人物,非洲锦标赛冠军
365bet体育在线下载

阿米奴:尼日利亚篮球领军人物,非洲锦标赛冠军

📅 08-23 👁️ 2575
妻子不让碰多久可以起诉离婚
365bet体育在线下载

妻子不让碰多久可以起诉离婚

📅 09-02 👁️ 4036
手机记天数app排行榜TOP10推荐
365bet怎么提款

手机记天数app排行榜TOP10推荐

📅 07-14 👁️ 3826
恐惧英文怎么说?恐惧英文例句大整理!
365bet怎么提款

恐惧英文怎么说?恐惧英文例句大整理!

📅 09-19 👁️ 5804