Sequences and HMM data
enVhogs (protein clusters)
Raw data
Tables
- Annotation tables 1.7Gb: Tables describing contigs, enVhogs, taxonomy. More precisely, this archive contains (click to see header descriptions):
contigs_infos_class.tsv: Info and taxonomy of contigs, number of enVhog of each category on each contig
ID
Contig
Length: contig length in bp
Source: data source (Collected/IMGVR/GLUVAB/eFAM/PHROGS/RefSeq_viral)
nb.Proteins: number of proteins identified on the contig
Taxonomy.rank
Organism
Taxonomy
nbProti: Number of proteins on the contig
nbEnv_c0_V: Number of enVhogs of category V on the contig at iteration 0
nbEnv_c0_S: Idem with category S
nbEnv_c0_U: Idem with category U
nbEnv_c0_C: Idem with category C
nbEnv_c3_V: Number of enVhogs of category V on the contig at iteration 3
nbEnv_c3_S: Idem with category S
nbEnv_c3_U: Idem with category U
nbEnv_c3_C: Idem with category C
envhog_infos.tsv: Infos and annotation of each enVhog
enVhog: unique identifier
prot: number of proteins in the enVhog
dedup: number of deduplicated proteins in the enVhog
shallow: number of shallow clusters in the enVhog
standard: number of standard clusters in the enVhog
pMaxLen: protein max length among cluster members
pAvgLeng: protein average length
pStdDev: standard deviation of protein lengths
NEFF: Diversity of the cluster (average entropy of the alignment)
bestPfamAc: PFAM accession number of PFAM best hit
bestPfamId: PFAM Id of best PFAM hit
bestPfamAn: PFAM Annotation of best PFAM hit
bestPhrog: Phrog ID of best phrog hit
bestPhrogAn: Phrog annotation of best phrog hit
PhrogCat: Phrog category of best phrog hit
bestKegg: Kegg gene ID of best kegg hit
nbClades: Number of viral clades containing an homologous protein
nbVir: Sum over clades of the proportion of viruses containing an homolog in each clade
nbHost: Sum over specied of the proportion of hosts containing an homolog in each cellular species
viralness: V_p
hostness: C_p
VQ: Viral quotient
cat: Category U/S/C/V based on VQ and hostness
hallmark: 1 if viral hallmark gene
catMin: Category U/S/C/V based on VQmin and hostness
catHall: Category U/S/C/V based on VQmin, hostness and hallmark gene (this category is used by the iterative algorithm)
VQmin: Version of the VQ but with Cp shifted by 5%
prot_dedup_shallow_standard_cluster_envhog_contig.tsv: Links between unique identifiers of proteins, contigs, shallow clusters, clusters and enVhogs.
Protein ID
Deduplicated representative ID
Shallow cluster representative ID
Standard cluster representative ID
enVhog cluster representative ID
contig ID
- enVhog categories 31Mb Classification of each enVhog using the Bayesian iterative algorithm
See header
envhog
cluster_class_X: category V/C/S/U at iteration X
lr_X: Log odd-ratio of category V vs category C at iteration X
- enVhog contig categories 31Mb Classification of each contig using the Bayesian iterative algorithm
See header
contig
contig_class_X: category V/C/H/U at iteration X
Scripts
The scripts used to create the enVhog database can be found on this git repository.
Citation
When using this work please cite:
EnVhog: an extended view of the viral protein space on Earth Perez-Bucio R., Enault F., Galiez C. bioRxiv 2024.06.25.600602; doi: https://doi.org/10.1101/2024.06.25.600602
Contact: Clovis Galiez
Download this page in text format here