TE 注释工具EDTA的安装及使用
in 生信工具 with 0 comment

TE 注释工具EDTA的安装及使用

in 生信工具 with 0 comment

EDTA(Extensive de novo TE Annotator)是一个可以从头注释全基因组中的TE并且评估已有TE库注释优劣的工具包。该工具的主要步骤是过滤掉原始TE中注释错误的,从而注释出全基因组中较高质量的非冗余TE库。工具包中的一些程序的选择只是基于水稻基因组中人工校正的TE库,可能对其他基因组并不是很合适。因此,在其他基因组中使用该工具包时应该特别注意。

作者使用了水稻基因组(TIGR7/MSU7版本)中的一个修正过的TE注释版本(v6.9.5)来进行了测试。

安装(更新)

主要有4种方法,可以任选其一。

使用conda快速安装

conda install -c bioconda edta

使用Singularity快速安装(适合在集群使用)

安装

singularity build --sandbox EDTA.sif docker://kapeel/edta

使用

singularity exec {path}/EDTA.sif /EDTA/EDTA.pl --genome genome.fa [other parameters]

{path} is the path you build the EDTA singularity image

使用Docker快速安装(适合root用户)

安装

docker pull kapeel/edta

使用

docker run kapeel/edta --genome genome.fa [other parameters]

使用conda分步安装

conda create -n EDTA
conda activate EDTA
conda config --env --add channels anaconda --add channels conda-forge --add channels bioconda
conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast java-jdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6 tesorter genericrepeatfinder genometools-genometools ltr_retriever ltr_finder
git clone https://github.com/oushujun/EDTA
./EDTA/EDTA.pl

输入文件

必须

FASTA格式的基因组文件(序列名不多于15个字符,且为简单字符,如字母,数字或下划线)

可选

输出文件

$genome.mod.EDTA.TElib.fa:非冗余的TE库。如果在输入文件中提供了修正版的TE库,则该文件中也将包含这部分序列。

其他输出文件

使用方法

详细参数说明

perl EDTA.pl [options]
  -genome	[File]	The genome FASTA
  -species [Rice|Maize|others]	Specify the species for identification of TIR candidates. Default: others
  -step	[all|filter|final|anno] Specify which steps you want to run EDTA.
			all: run the entire pipeline (default)
			filter: start from raw TEs to the end.
			final: start from filtered TEs to finalizing the run.
			anno: perform whole-genome annotation/analysis after TE library construction.
  -overwrite	[0|1]	If previous results are found, decide to overwrite (1, rerun) or not (0, default).
  -cds	[File]	Provide a FASTA file containing the coding sequence (no introns, UTRs, nor TEs) of this genome or its close relative.
  -curatedlib	[file]	Provided a curated library to keep consistant naming and classification for known TEs.
			All TEs in this file will be trusted 100%, so please ONLY provide MANUALLY CURATED ones here.
			This option is not mandatory. It's totally OK if no file is provided (default).
  -sensitive	[0|1]	Use RepeatModeler to identify remaining TEs (1) or not (0, default).
			This step is very slow and MAY help to recover some TEs.
  -anno	[0|1]	Perform (1) or not perform (0, default) whole-genome TE annotation after TE library construction.
  -evaluate	[0|1]	Evaluate (1) classification consistency of the TE annotation. (-anno 1 required). Default: 0.
			This step is slow and does not affect the annotation result.
  -exclude	[File]	Exclude bed format regions from TE annotation. Default: undef. (-anno 1 required).
  -threads|-t	[int]	Number of theads to run this script (default: 4)
  -help|-h	Display this help info

1. 生成基因组中的原始的TE库

对于不同的TE类型,可以分别指定-type ltr|tir|mite|helitron来运行,如果是想要所有类型的TE,直接指定-type all就可以。
这里使用示例数据水稻基因组来进行测试,下载位置rice genome

perl /raid8/cuixb/tools/biosoft/EDTA/EDTA_raw.pl -genome rice_genome.fasta -species Rice -threads 6

2. 完成其他分析

这里可以通过指定-overwrite 0来使软件自动识别文件夹下的已有分析结果。

perl EDTA.pl -overwrite 0 -genome rice_genome.fasta -species Rice 

对已有的TE库进行评估

如果你已经有了一个TE库,想对该TE库的注释效果与其他方法比较,那么可以进行如下两步分析:

对你的基因组根据已有的TE库进行注释

以水稻为例:

RepeatMasker -pa 36 -q -no_is -norna -nolow -div 40 -lib custom.TE.lib.fasta -cutoff 225 rice_genome.fasta

测试某一TE类型的注释效果

perl lib-test.pl -genome genome.fasta -std genome.stdlib.RM.out -tst genome.testlib.RM.out -cat [options]
    -genome	[file]	FASTA format genome sequence
    -std	[file]	RepeatMasker .out file of the standard library
    -tst	[file]	RepeatMasker .out file of the test library
    -cat	[string]	Testing TE category. Use one of LTR|nonLTR|LINE|SINE|TIR|MITE|Helitron|Total|Classified
    -N	[0|1]	Include Ns in total length of the genome. Defaule: 0 (not include Ns).
    -unknown	[0|1]	Include unknown annotations to the testing category. This should be used when
                    the test library has no classification and you assume they all belong to the
                    target category specified by -cat. Default: 0 (not include unknowns)

例如:

perl lib-test.pl -genome rice_genome.fasta -std ./EDTA/database/Rice_MSU7.fasta.std6.9.5.out -tst rice_genome.fasta.test.out -cat LTR

参考

  1. Ou, Shujun & Su, Weijia & Liao, Yi & Chougule, Kapeel & Peterson, Thomas & Jiang, Ning & Hirsch, Candice & Hufford, Matthew. (2019). Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline. 10.1101/657890.
  2. EDTA Github