Dragon Metagenomic Analysis Platform
For the Impatient, DMAP available at http://www.cbrc.kaust.edu.sa/dmap is a public resource providing annotations and comparison of genomic or metagenomic sequence data through its Annotation and Compare Modules, see figure 1 below for DMAP workflow. Phase I metagenomic analysis provided results for over 275 million genes annotated from several existing and new metagenomic gene catalogues, summarized results are available as Gene Information Tables (GIT) containing extended taxonomic, functional role and industrial application assignments. These results are available for download, online viewing, (see Table 1 below for a complete list and links) or analysis through DMAP Comparison Module project 119. We also provide assemblies from over 27,000 metagenomic samples, see supplementary data here.
Table of Contents
Access to DMAP and Example Use Case
CAMI data, a use case for DMAP
Gene Catalogues and samples in DMAP
DMAP Annotation Module (DAM)
Input to AAMG
Input sequence types and options
BLAST bitscore cutoff
AAMG Job Submission and Job History
Gene Annotation Thresholds
Gene Annotation and Parallel Computing
Gene Annotation Limitation
Purpose built databases
Industrial Proteins of Interest (POIs)
Combining AAMG based Annotation and WWW visualization
AAMG Results in GFF, FASTA and TSV
Annotated Gene Subsets as Filters
Integration of Gene Abundance or Read Mapping data
AAMG Results through Interactive Visualizations
Contig Binning and Metagenomic Species Binning
Functional Role Assignments
Annotation Data Files
DMAP Comparison Module (DCM)
Analyze samples in a Project
Single Sample exploration
Browse Sample for Taxonomy, Pathways, Gene Ontology, Enzymes
Multiple Sample Comparisons
Visualization of Metagenomic Sample Comparison
Heatmap plot (and fold change)
Stacked Bar Plot (and fold change)
Statistical Tests (two samples)
Interactive KEGG Pathway Module Completion Test
Examples of Metagenomic Comparisons
Tips for using DMAP Compare Module
Samples and Sample Groups
Examples from TARA ocean
Human Integrated Gut Microbiome.
ENA ecological metagenomes.
Gene catalogues from ecological metagenomes
Dragon Metagenomic Analysis Platform (DMAP) has variety of useful resources for genomic and metagenomic analysis, categorized into two easy to understand modules namely Annotation module and comparison module. Users interested in already annotated gene catalogues or samples can use DMAP Comparison Module (DCM) for browsing or comparison of samples at various biological hierarchies such as taxonomy, gene ontology (GO), Kegg Pathways, Kegg Orthologs (KO), Enzymes, etc. Other users interested in the analysis of their own genomic or metagenomic samples can make use of DMAP Annotation module (DAM) available via improved version of the our pipeline for Automatic Annotation of Microbial/MetaGenomes (AAMG, Alam et al., 2013). DMAP workflow is shown in Figure 1 below.
Figure 1. DMAP Workflow.
The following documentation will help you explore DMAP Annotation and Comparison modules through some use cases. We first describe access to DMAP and public projects.
DMAP is accessible free of charge without any need to login at the following URL: http://www.cbrc.kaust.edu.sa/dmap , following is a snapshot of the mainpage:
Figure 2. DMAP main page showing DMAP options to access this system.
Clicking on ‘Public Access’ on the main page takes you to project area showing publicly available projects and an option to annotate your own sample data through AAMG, as shown in Figure 2 below.
Figure 3. DMAP Project area for Public Access
We recommend creating a user account if you wish to use your own data, otherwise Public Access is sufficient to understand the capabilities of DMAP and to access public projects. To create a user account, click on ‘Register’ link on the main page, fill up all the required details and submit. For account activation, please send an email to us at email@example.com , once your account is activated you can log in using your username and password to submit your data for annotation via AAMG, access your annotation jobs history or access private or public projects assigned to you for browsing or sample comparisons.
We are using CAMI (Sczyrba et al., 2017) data as a use case to explain capabilities of DMAP using metagenomic contigs as input. To this end we annotated already available golden assemblies from Low, Medium and High complexity CAMI metagenomic samples using AAMG pipeline available in DMAP. For browsing and sample comparison, a gene information table, referred as AAMG Tab Separated Values (TSV) based Annotation results. These data are lucene indexed and uploaded into DMAP Comparison Module (DCM). The DMAP Comparison Module (DCM) is based on an improved version of Metarep (Goll et al., 2010).
The peptide annotations generated through DMAP annotation module can be downloaded in AAMG TSV and other formats (see section on AAMG output) for use in analysis through other metagenomic sample comparison systems.
For local use or analysis through any other metagenomic sample comparison system, annotations generated through DMAP annotation module can be downloaded in AAMG TSV and other formats, see section on AAMG output.
We provide access to DMAP based gene information tables containing extended annotations for over 275 million genes obtained from several existing public metagenomic gene catalogues alongside a number of new gene catalogues obtained from 36 different ecological metagenome types obtained from European Nucleotide Archive (ENA).
Large public studies
Several public metagenomic gene catalogues from microbial or micro eukaryotic origin were annotated using AAMG protocol. This includes large gene catalogues from TARA Ocean microbial and micro eukaryotic studies, Metagenomic Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) and Human Gut Microbiome. In case of microbial gene catalogues, Coding DNA Sequence (CDS) input option in AAMG was used. However in case of micro eukaryotic gene catalogues, protein sequence input was used. Access links to complete AAMG output, AAMG TSV and Industrial POIs are provided in the table below.
ENA public metagenomes
In our first pilot study, with a long term aim to provide gene information tables and mining of industrial POIs for all public metagenomic samples deposited to European Nucleotide Archive (ENA), we started downloading fastq files for different types of ecological metagenomes from ENA. We produce gene catalogues for every metagenome type by first assembling individual metagenomes, followed by gene prediction and clustering of full length genes. Reads are mapped from each sample to the relevant gene catalogue to obtain gene abundance information. Gene catalogues are annotated using DAMP Annotation module. As of September 2018, we provide gene information tables and complete annotation for 36 metagenome types, see DMAP supplementary table for included gene catalogues and links to annotation resources.
DMAP compare module provides access to Graphic User Interface (GUI) for metagenomic data view, browsing, or comparison. All gene catalouges we annotated are available for access in DMAP Project 119, http://www.cbrc.kaust.edu.sa/dmap/projects/view/119.
User submitted assembled metagenomes or gene catalogues can be annotated through DMAP and upon user request such datasets can be indexed and provided for public or private access. A number of metagenomic samples were annotated and indexed for user specific projects with private access. These projects will become public on user request most probably at the time of publication.
We use CAMI data in our use case to describe different aspects of two DMAP modules. First the Annotation Module and secondly the Comparison Module.
An overview of DMAP module is provided in Figure 1, showing types of input accepted, mechanism of gene annotation and integration of results. More details are provided in the relevant sections below.
DMAP Annotation Module is accessible to users through the AAMG tab visible in DMAP interface at the top left side, see Figure 3, after public or private access. Clicking on the AAMG tab shows an interface to submit user sequence data for annotation, or explore job history, see Figure 4 below. Job history is not available for public access or guest user.
Figure 4. AAMG input sequence submission options.
AAMG accepts Coding DNA sequences (CDS) and protein sequences from any source. AAMG also accepts genome or metagenome annotation data, these data are required to be chromosomes or contig sequences from a microbial source. This is because the gene prediction methods employed by AAMG at present are “microbial only” gene predictions methods.
AAMG accepts input sequence data from any source in case of proteins or Coding DNA Sequences (CDS). However for a genome or metagenome annotation it requires chromosomes or contig sequences from a microbial source due to microbial only gene prediction methods implemented at present. As shown in Figure 4, AAMG submission interface provides three input types and additional options accordingly.
There are two additional fields which are needed, one is Locus Tag and the other as BLAST bitscore cutoff.
Locus Tag is a unique string you assign to your annotation run so that you can identify the type of data you annotated. E.g. if your annotation is for a dataset such as CAMI, the low complexity metagenome and you choose Prodigal as a gene prediction method, you could set ‘CamiLowProdigal’, as string from Locus Tag option.
There are several ways to restrict the list of BLAST based matches when comparing a protein sequence obtained from an input to AAMG against reference databases such as UniProt and KEGG KOs. One could use either an E-value, Percent Identity or BLAST bitscore. We use BLAST bitscore inclusive lower bound cutoff of 50 to obtain a reasonable list of matches. Later we provide a mechanism to restrict the results in DMAP compare module based on E-value, Percent Identity and/or Coverage. In this BLAST bitscore cutoff option at AAMG submission stage a user can select BLAST bitscore of 50 or higher.
When all options are provided at the AAMG submission page, the user can upload their input sequence file and click on Submit Dataset, as shown in an example below in Figure 5. After a datasets has been uploaded, the webpage displays the summary information such as file size of the uploaded data. This is the stage the user can now submit the annotation job for processing by clicking on ‘submit for processing’. This page is updated with the information on the sequences submitted and provides a job id for this run.
AAMG annotation job history is not available when in public access mode. However, when a user is logged in using their DMAP account, link to ‘View History’ provides the list of successful AAMG jobs submissions with links to view AAMG output for each Job Id.
Figure 5. AAMG interface job submission section.
Automatic Annotation of Microbial or Meta Genomes is the core of DMAP annotation module. Input and options accepted through DMAP annotation module are passed to AAMG for comparison to public RNA or protein reference databases. It also includes profiling of genes to our purpose built Bioactives and Industrial Proteins of Interest (POI) datasets. Results from comparison to multiple sources are analyzed and integrated to provide robust taxonomic and functional role annotations. The following sections provide specific details of AAMG processing of input data through different stages of annotation.
In case of input as assembled sequence data, in the form of chromosomes or contigs from a microbial source, AAMG first kicks in RNA prediction and masks the input DNA sequences ready for protein coding gene prediction. The RNA prediction stage includes tRNA genes such using tRNAscanSE (Schattner et al., 2005) and rRNA using RNAmmer (Lagesen et al., 2007). Recently improved AAMG includes high throughput metagenomic rRNA prediction using (Huang et al., 2009) and also non-coding RNA (ncRNA) prediction using Infernal (Cole et al., 2013). RNA prediction is followed by user selected protein coding open reading frames (ORFs) methods. Available choices are Prodigal (Hyatt et al., 2010), FragGeneScan (Rho et al., 2010) or MetaGeneMark (Zhu et al., 2010).
In other input cases, both RNA or protein ORF gene prediction steps are not carried out. In these cases sequences are accepted from any source, for example a non-redundant gene sequence catalogue from several metagenomic samples, or gene sequences from a non-microbial source, etc. Only additional step in case of input being protein CoDing Sequece (CDS), is the standard translation to protein sequence. Once protein version of gene sequences become available AAMG gene annotation process kicks in.
Annotation of genes is performed by comparing predicted genes to reference databases. In order to obtain closest possible taxonomic information, AAMG performs a blastn search for Ribosomal RNA genes to RNA genes obtained from NCBI and similarly Infernal based ncRNA genes are compared to RNA Family (RFAM) reference sequences from EBI.
Protein coding ORFs comparison to reference databases is carried out using highly parallel blastp implementation segmenting the input into equal size fragments. ORFs are first compared with UniProtKB database (www.uniprot.org) to seek relative taxonomic information, eggNOG (http://eggnogdb.embl.de) references and generic functional role information. For a more robust functional role assignments, ORFs are compared to subset of KEGG (www.kegg.jp) database sequences where KEGG Ortholog (KO) identifier information is readily available. Finally, ORFs are compared to protein functional signature databases in InterPro using InterProScan (Jones et al., 2014) leading to Gene Ontology (GO), Pfam domains, TigrFAM domains and Prosite signatures assignments.
To control the quality of matches during gene annotation process, AAMG applies a user provided BLAST alignment or bitscore threshold, with minimum allowed bitscore of 50 and BLAST e-value of 1e-2. Later at the stage of parsing and integrating BLAST based results from multiple sources, a minimum percent identity of 30 and minimum percent query coverage of 50 is enforced.
The process of gene annotation consists of three main steps, first the computationally most expensive step of alignment to reference databases, secondly processing of the alignment reports and finally combining annotations from multiple sources. The time complexity of these steps increases with the increasing number of query sequences. The alignment step becomes computationally very demanding when the size of reference databases is large, e.g. UniProtKB provides around 100 million sequences. To speed up our alignment step, we produce equal size (e.g. 2MB) subsets from the query set of sequences and for all subsets we perform BLAST or InterProscan jobs in parallel.
As shown in the above Figure, taking example of TARA oceans gene catalog with 40 million genes, we produced ~6000 subsets leading to the same number of jobs for e.g. BLAST running in parallel. Average run time of jobs varied slightly but all BLAST jobs reached completion in less than 24 hours. To speed up the subsequent BLAST report parsing, we process reports from all subsets in parallel, read processed results in memory and write the final filtered output for further processing. This requires large computing resources and we make use of KAUST’s high performance computer cluster, Ibex with around 50,000 cores, and KAUST’s supercomputer, Shaheen II with around 200, 000 cores.
Please note that AAMG uses KAUST’s Ibex cluster to process user data free of charge. In cases where predicted genes exceed 1,000,000 AAMG is able to carry out expensive annotation steps using KAUST’s Shaheen II resource equipped with ~200,000 processors. In order to process large datasets, where access to Shaheen II is required, a special arrangement can be made upon user’s request. Otherwise it is recommended to split your large datasets into reasonable sizes such that estimated number of genes are below 1 million.
We prepared two purpose built databases containing references to important genes with a potential of a bioactive peptide or relevance to a particular industry. User data is profiled against these databases to pinpoint important genes present in a genome or metagenome being annotated through AAMG.
We search UniProtKB data for genes curated with certain types of keywords relevant to bioactive peptides such as antimicrobial, antiviral, antibiotic biosynthesis, bacteriocin, toxin, bacteriolytic enzyme, polyketide synthase, secreted, etc. This keyword information is placed in the sequence header of UniProtKB sequences and relevant bioactive keyword groups are reported during gene annotation process based on BLAST results for genome or metagenome under AAMG processing.
We searched several resources, including research papers and public databases, e.g. brenda, to obtain a collection of enzymes or proteins, with potential role in or relevance to a particular industry, linking KO ids or enzyme classification numbers to relevant industries, we call this resource Industrial Proteins of Interest (POIs). There are several type of industries for which we have a collection of relevant genes. Example of industries are shown in the Figure 6, below.
Figure 6. Groups of Industrial Proteins of Interest (POIs).
During the integration of gene annotations, we lookup our POIs resource and associate relevant information to genes from a genome or metagenome annotated by AAMG.
Results obtained through gene prediction followed by gene annotation process are analyzed to obtain both taxonomic and functional role assignments. Initially, BLAST or InterProscan results from multiple sources are parsed to keep first level of result files preferably in GFF format. Related specific taxonomic or functional assignment results are saved per reference database, see Figure 13 for snapshot of example result files. Finally, all of the derived results are integrated into AAMG TSV, a comprehensive 19 columns table, see Table 1 for an example. This data integration is based on annotations from multiple sources for each gene considering minimum percent identity of 30 and minimum percent query coverage of 50.
AAMG TSV is similar to MetaRep (Goll et al., 2010) tab format data file. AAMG TSV stores annotation information for a genome or metagenome using unique gene and a sample ids. In cases where multiple values are available for a particular annotation column, values are compiled together using the “||” delimiter for later use.
Table. AAMG TSV table columns and examples
Taxonomic assignments based on UniProtKB or KEGG BLAST results, we call Gene Based Taxonomy (GBT) assignments, are obtained using both BLAST best hit approach and also the Lowest common ancestor (LCA) approach but AAMG TSV shows only the LCA based taxonomic assignment. Our GBT implementation first sorts BLAST results based on query length normalized percent identity. In the best hit approach the taxonomic information for the top hit is saved. However, in LCA approach top alignments, with a difference of no more than 10% of the best hit bitscore, are processed to obtain a taxonomic level where taxonomic information from top alignments converge. BLAST LCA based results are saved in the AAMG TSV file however both best hit and LCA results are available separately in files marked as GBT in the name. AAMG TSV columns such as Taxon ID, Evalue, Percent Identity and Percent Coverage are populated using hits, in the order of higher to lower stringency, from either UniProtKB or KEGG.
Robust functional role assignments obtained, through KO identifiers available for best hits from KEGG KO BLAST results, are utilized to populate AAMG TSV columns such as Gene Name, Gene Name Source, EC ID, EC ID Source, KO ID and KO ID Source. If no such assignments can be obtained, generic gene descriptions for Gene Name and Source column are obtained from UniProtKB BLAST results. COG and eggNOG columns in AAMG TSV are populated with relevant information, when available, from eggNOG cross-references to UniProtKB hits. Additionally, InterProScan based results are used to populate GO ID, GO ID Source and InterPro ID fields.
Gene Name Source column captures all sources of annotation e.g. source chromosome or contig on which gene was predicted, source hit IDs from UniProtKB, KEGG or InterPro. In addition it stores the BLAST scores for KEGG based hits. Usually AAMG TSV taxon Id, e-value, percent identity and percent coverage information is obtained from UniProtKB, unless KEGG based hits show better scores.
To analyze explosion of gene annotations from multiple sources, AAMG provides an easy way to study a more focussed subset of genes at a time. In AAMG analyses we produce subsets or ‘Filters’ based on grouping annotations into several categories, marked here with alphabets from A-J, as shown below. These categories can have sub categories for studying the data with more resolution. Information of ‘Filter’ assignment to each gene is recorded in column 14 of AAMG TSV.
These filters can be used to focus on subset of genes for a particular analysis in mind. Our DMAP Sample Comparison Module, see relevant section below, provides exploration or comparison of samples where focus on subset of genes is possible by selection of filters, a few examples with a filter selected are shown in Figures 17, 21, 25.
AAMG TSV has a specific column, called weight, to include numerical gene abundance or read mapping information, see column 17 in Table 1. This weight is counted as 1, if no information is provided. Otherwise this field can be populated using normalized read counts e.g. FPKM, etc. This is very useful in comparing multiple genomes, transcriptomes or metagenomes. It becomes more useful data reduction and comparison approach when multiple related metagenomes are being compared based on a common gene catalogue, see section on gene catalogue based metagenomic comparisons below.
AAMG computational processing is carried out on our High Performance Computing (HPC) cluster, Ibex. During the annotation process BLAST and InterProscan results are processed and saved for various taxonomic and functional role assignment visualizations, see breakdown of these in the following section. Once the processing of a annotation project is complete, all results are transferred to our AAMG visualiation server, http://www.cbrc.kaust.edu.sa/aamg/job.id.
Each annotation project receives a unique job id based on ‘LocusTag’ and user credentials. Such a job id becomes a basis for the unique www url for each individual project. In our use case here, we used a locus tag ‘CamiLowProdigal’ and AAMG generated the following job id, 1502720332816_CamiLowProdigalV2_40.0_intikhab. Complete annotation result files alongside various visualizations for this and other two example projects are saved at the following URL:
A few example annotations and visualizations are explained the following section. As shown in Figure 7 below, AAMG provides a user friendly interface to results for an annotation project. It shows the following four sections or tab, namely ‘Annotation Features’, ‘Taxonomy Assignments’, ‘Functional Assignments’ and ‘Annotation data files’.
The ‘Annotation Features’ tab provides counts of different kinds of genes predicted and their annotation through several sources including UniProt, KEGG, COG, InterPro and Gene Ontology alongside a count of total annotated and unassigned genes. Three additional features available on this page are a) Retrieve Sequences, b) Perform Contig Binning and c) Perform COGs Binning.
Figure 7. A snapshot of AAMG output web page.
To obtain sequences from an AAMG annotation project, it requires a project id and a list of contig or gene identifiers. Clicking on ‘Retrieve Sequences’ link, from the main ‘Annotation Features’ tab of a project, opens an interface to accept project Id and list of sequence identifiers leading to download of related DNA or protein sequences. Sequence identifiers are available while exploring AAMG result data files.
Pushing the ‘Perform Contig Binning’ button while selecting a particular taxonomic level on the ‘Feature Counts’ page initiates the contig binning process of the corresponding project. During this process, Gene Based Taxonomy (GBT) information is summarized for each contig in such a way that contig bins are formed where majority of genes on a contig belong to a common taxonomy, as shown in Figure 8, at the user selected taxonomic level e.g. phylum or class, etc. Those contigs are considered clean where majority, above 65%, of genes converge on a common taxonomic assignment while others are marked novel or potentially chimeric for further investigation.
Figure 8. An example contig binning report
A contig binning report, as shown in Figure 8, explains for each contig bin the total length of contigs, number of contigs and percent of genes assigned to taxonomy of this bin at the user selected level. In case of gene catalogues, e.g. from TARA ocean or Human Gut Microbiome or else, this binning is entirely based on genes not contigs and it results in full collection of diverse set of genes which belong to taxonomic groups at a selected taxonomic level, leading to potential Metagenomic Species (MS) bins. A list of gene catalogs annotated using DMAP with potential MS bins is available at http://www.cbrc.kaust.edu.sa/aamg/DMAP_Data/ .
Another binning option provided by AAMG is COGs binning. In this binning approach, AAMG annotated genes with COG assignments are collected and summarized into higher COG functional classes, as shown in Figure 9, from the corresponding genome or metagenome project. The COGs binning results from multiple genomes or metagenomes can become the simplest approach to compare functional potential.
Figure 9. An example COGs binning report.
Furthermore, Taxonomy Assignment and Functional Assignment tabs provide access to each sub type of annotations in the form of top ten hits, links to interactive Krona graphs and their background summary and full tables, a snapshot of these tabs is shown in Figure 10 below. The ‘Annotation Data Files’ tab provide access to download result files.
Figure 10A. Taxonomy Assignments.
Figure 10B. Functional Assignments.
Figure 10C. Annotation Data Files.
Taxonomic detail of a dataset is provided based on Gene Based Taxonomic (GBT) annotations, see Figure 10 A. In case of CDS or proteins as input, only protein coding genes based results are provided. However in case of chromosomes or contigs as input, RNA based taxonomic assignments are also included as shown in Figure 10A and also indicated in the workflow of input to DMAP Annotation module in Figure 1. Combined taxonomic assignments section shows results from the integration of taxonomic assignments from all sources. An example from such an integration is shown as interactive Krona graph in Figure 11 below. Clicking on this image or on the hyperlink in the figure legend will lead to the actual interactive example. These are searchable interactive graphs allowing a user to select depth of a hierarchical level under investigation for which a breakdown of assignments can be explored.
Figure 11. An interactive Krona Graph for taxonomic assignments.
Complete background summary count data, used in creating such an interactive Krona graph, is available for download. A non-summarized version of this data is also available as raw TSV file, see for reference the last section of Figure 10 A. Similarly other taxonomic assignment tabs provide links to actual data and interactive graphs in relevant categories, see example links in the Table below.
Taxonomy assignments and example hyperlink
Combined Gene Based Taxonomy (GBT) assignment example.
Protein Coding GBT assignment example.
RNA GBT assignment example.
RFam GBT assignment example.
(Please click on Taxonomic Assignments tab once you reach the above hyperlinks).
Robust functional role assignments in AAMG are obtained through comparison of gene sequences to specific gene function reference databases and available relevant cross references therein. There are several categories of functional role assignments available in AAMG results, including GO, Industrial POIs, COG, KEGG, Bioactives and InterPro domains. An example interactive graph for KEGG classification of our example use case CAMI data is shown in Figure 12 below.
Figure 12. An interactive example of KEGG functional hierarchies graph
Users are able to see interactive krona graphs for each of these categories alongside options to download background summary counts and unsummarized gene counts data as TSV tables, see lower section of Figure 10B and the following hyperlinked categories from our use case:
(Please click on Functional Assignments tab once you reach the above hyperlinks).
Complete AAMG annotation results data is available either as individual files or a tar.gz archive from ‘Annotation Data Files’ tab, see Figure 10C for an example. A click on ‘result files’ link on this tab leads to an interactive Krona graph showing available files for download such as BLAST results in GFF, InterProScan results as domains gff, AAMG TSV and other specific GFF or Fasta files, see Figure 13 for an example. Clicking on a category representing a file name shows a link to download and size of the file.
Figure 13. Annotation result files graph
Exploring DMAP based results from a single genome or a metagenome in itself is very interesting showing the breadth of taxonomic and functional repertoire. For comparison of multiple samples, genomes or metagenomes, it requires a comprehensive sample comparison system. Passing raw data through a single annotation pipeline with reference databases updated on the similar timescale makes multiple sample comparisons more comparable without any technical bias. See section on DMAP Sample Comparison Module below for an insight into multiple samples based analysis.
DMAP Annotation Module provides comprehensive annotation of genes based on several reference datasets and its AAMG TSV format table output can easily be used for local analysis or an exploration based on user’s preferred analysis system. We provide exploration or comparison of multiple genomes, metagenomes or transcriptomes through DMAP Comparison Module, powered by MetaRep (Goll et al., 2010). DCM is deployed on several high memory computers and it provides a few additional features on standard MetaRep including interactive KEGG Pathway module completion exploration, other interactive heatmaps and implementation of magnitude or fold change filter.
To understand how to access DMAP, please refer to section on public or private access DMAP above. Once a user is able to get access into DMAP, initial screen shows list of available public or private projects, see Figure 3 for an example. DCM analysis options are accessible once a project is selected through ‘view project’ link.
Clicking on ‘view’ hyperlink in the project list page leads to ‘View’ section of a project and the ‘View Project’ page shows, see Figure 14, sample metadata as well as options for analyses..
Figure 14. View section of a project showing access to Analyze options.
The ‘Analyze’ menu in the View Project allows a user to either view, search or browse a single sample for hierarchical classification of taxonomy, kegg pathways, enzymes or gene ontology, etc. Compare option in ‘Analyze’ menu provides users with options to select multiple samples and compare their taxonomic or functional profiles to produce a comparative table or a plot. Here below we show a few examples of these options.
Options available to explore a single sample are view, search and browse. Sample View option gives access to annotations for all genes, showing at least 20 genes on every page, see Figure 15 for an example.
Figure 15. A snapshot of sample ‘view’.
This page also gives access to look at ‘Summary’ of annotations for the complete sample in terms of assigned taxonomic, Gene Ontology, Enzyme, KEGG Ortholog or total genes. Similarly summary of taxonomic assignments from different domains of life is also shown.
Figure 16. View summary of annotations for a sample.
In addition, on this ‘View’ sample page, one is able to view counts of abundant taxonomic of functional elements in their respective category, see Figure 16. The ‘Filter’ option allows to look at subset of data by selecting a value from the ‘Filter’ menu. For an example, see Figure 17, to look at function and taxonomic summaries for genes assigned with KEGG Ortholog (KO) id, filter ‘KO’ can be selected. Similarly, other ‘Filters’ display results specific to set of genes where the selected filter is available.
Figure 17. Summary of annotations for ‘Filter’ KO.
The second option available Analyze submenu is to perform search into a particular sample. By default it shows a result based on *.* in the search box. Default results are split into three sections namely top 10 classifications, top 10 pie charts and all search results in text. A help dialogue box, shown on top of the default search results, explains different search options and search procedure. See example search results in Figure 18 below.
For example, to search fixed database IDs e.g ko_id or GO_tree, etc., a search terms like the following will work:
ko_id:K01692 AND go_tree:3824
To restrict the above query, e.g. to bacteria, you would extend the search terms as the following:
(ko_id:K01692 AND go_tree:3824) AND blast_tree:2
Figure 18. DCM example showing search annotations of a sample
Browse facility in DCM educates a user to understand hierarchical classification of biological data from Taxonomy, Pathways, Enzymes or Gene Ontology annotations. It provides an interface to select a particular hierarchical level for which results are shown in terms of pie charts and top 10 classifications. For an example see Figure 19 where using our use case CAMI sample when taxonomic level bacteria is selected, it shows breakdown of child taxonomy as a list with counts of genes and as proportions in the main pie chart. It also shows top 10 functional classifications and their respective pie charts. A change in selection of taxonomic class updates results instantly.
Figure 19. An example result from Browse Taxonomy.
Similarly it shows breakdown of sub categories and classes for Pathways, Enzymes and Gene Ontology, try selecting relevant option from Analyze menu of a sample in Project View page or click on example hyperlinks from the table below.
Browse (hyperlinked) examples
Browse Taxonomy example
Browse KEGG Pathways example
Browse MetaCyc Pathways example
Browse Enzymes example
Browse Gene Ontology example
The most exciting feature of DCM is Compare Samples. This is a standard feature available in metarep but here we included additional sub features mentioned below. Compare feature is available from the Analyze menu on the View project page. Once selected, it shows an interface to add or remove samples, apply search patterns and terms and view results as table or figures.
Figure 20. DCM Compare Samples Interface.
In this figure 20 a breakdown of taxonomy is shown for selected samples at the phylum level with visualization option selected as ‘absolute counts’. In this example case or in case of a genome or metagenome, this absolute count value represents number of genes found in a particular sample. In case of transcriptomes or gene catalog based read counts, this absolute count value represents read counts which may already be pre-normalized.
Also, please note the text box in ‘Filter Dataset’ section in the figure above where 16S_rRNA is selected, this is the section that allows us to focus on specific types of annotation subsets, explained as filters in the DMAP Annotation Module.
Comparison of samples at DCM can also be visualized using several graphs or plots as listed in options menu on the compare page, namely hierarchical cluster plot, heatmap, multidimensional scaling or stacked bar plot, etc. Different graphical views of the same starting data help analyze a sample comparison more efficiently.
Let us use the example of sample comparison from figure 20, showing results as a gene counts table. Selection of heatmap plot from the options menu, instead of absolute counts, produces a heatmap, see Figure 21 for this example. Heatmap result page offers options to change the underlying ‘distance matrix’, ‘clustering method’, ‘dataset label’, color scheme of the plot and recently added options for ‘interactive heatmap’ and fold change thresholds.
Figure 21. Example of sample comparison as a heatmap
The aspect of fold change is quite interesting to study the magnitude of difference in selected samples in terms of diverse gene or expression counts for taxonomic or functional categories under investigation. Using our example sample comparison from Figure 21, when we apply a fold change value of 4 it results in, see Figure 22, filtering out taxonomic phyla with gene counts below average or below 4 fold magnitude.
Figure 22. Example of heatmap with fold change applied.
Static heatmaps enable use to visually see differences in abundance of gene or expression counts, however to dig deeper e.g. in case of fold change estimates among each pair of samples being compared, we added interactive heatmap option, see example Figure 23 for data presented in Figure 22. Interactive heatmaps option, using jheatmap (Deu-Pons et al., 2014), is available via a hyperlink placed just above the plot window. A click on any cell of the interactive heatmap shows a set of samples being compared and shows the scale of fold change annotation categories under investigation.
Figure 23. Interactive Heatmaps to study binary fold change information.
To look at sample comparisons from a different perspective we recently added stacked bar plot option. Stacked bar plot offers to present a comparative view of samples showing proportion of contributions from different subclasses for a selected level of taxonomic or functional class, see example figure 24. Items contributing less than 5% of contribution are placed in category, other. The fold change option is also included. Interactive heatmaps are also available here to find out exact fold change or magnitude differences for every set of samples and every functional class being compared. Links to interactive heatmap, marked as ‘view heatmap’, are kept right on top of the window showing stacked bar plot.
Figure 24. Compare Samples visualization through stacked bar plot
DMAP Comparison module built on standard metarep (Goll et al., 2010) framework provides extensive functional aspect of sample comparisons. This includes KEGG Pathway (KO), KEGG Pathway (EC), MetaCyc Pathway (EC), exploration of Enzymes, KEGG Orthologs and Gene Ontology (GO). Figure 26 shows an example of KEGG Pathway (KO) gene counts based comparison of our use case CAMI metagenomic datasets with Low, Medium and High taxonomic Complexity. In this example number of different KO assigned genes from all samples are shown for different categories of pathways at root level. Other available pathway levels to view are super pathways and pathways. Results can be visualized as tables or plots similar to taxonomic visualization option. It is very interesting feature of standard metarep and DCM that for the same data set, e.g. genes assigned with say KO information can easily be compared across samples for taxonomy, pathways, gene ontology and enzymes by just clicking on relevant tab in the result panel of the compare page.
Figure 25. KEGG Pathway KO comparison of samples.
There are several different procedures or options available in DCM to analyze multiple samples from a genome or metagenome for taxonomic or functional annotation categories. One interesting method implemented here is to compare two samples by applying one of the statistical tests from available fisher exact test or equality of proportion test. Figure 26 shows an example of Fisher Exact Test applied for KEGG Pathway (KO) category of comparisons at level pathway. It shows counts and statistics for two samples compared for each of the pathway detected in samples under investigation.
Figure 26. KEGG Pathway (KO) Fisher exact test
Although these statistical tests can be applied to both taxonomic and functional comparison, there is one additional feature when pathways are compared, see a small figure at the end of the resulting table, clicking on this opens up a comparative analysis page showing two sections, see Figure 27 for an example. The top section of this result shows KEGG Pathway diagram mapped with KO gene counts (number or expression counts), as light or dark colors, detected in one or both samples. Density of the color is related to the density of gene counts or expression data. The second section shows a table of statistics for two samples with gene or expression counts alongside other statistics.
Figure 27. Statistical comparison of two samples for a pathway
Standard metarep based pathway comparison is very interesting but it requires expert knowledge of a pathway to tell whether it is complete or functional in a sample. In simple terms we do not know whether we are able to find all critical genes required for a pathway in a sample or more simply what are those critical genes without which a pathway is inactive or incomplete. In order to distinguish complete and incomplete pathway modules, we describe our implementation of an interactive pathway module completion test, see relevant section below.
Given expression or gene presence absence data for genes in a pathway for a genome or metagenome, it is not straightforward to mark active or inactive parts (e.g. modules) of a pathways due to expert knowledge required in identifying critical genes. Comparing multiple samples in order to see complete or incomplete modules is even more difficult. We developed an interactive heatmap test to study pathway module status (whether complete or incomplete) across a set of samples being compared, an example of such an interactive heatmap is shown in Figure 28 below.
Figure 28. An interactive heatmap showing Pathway module completion status
To construct such a heatmap, we make use of KO gene expression or presence absence information available from samples and KO genes based logical expression information available under definition section for modules in KEGG pathways. If a sample passes the logical expression test for a particular module, satisfying the requirement of critical or alternative KO genes as present in a sample, this module is marked as complete for sample under comparison, see M00048 module (Inosine monophosphate biosynthesis) example in Figure 28. A list of modules marked complete, in at least one of many samples being compared, is then presented as an interactive heatmap. Here, colored cells represent complete module while no color shows incomplete modules.
Clicking on any cell of such an interactive module completion heatmap shows a link to KEGG website, see module M00048 example in Figure 28. Clicking on this link (use right click and open link in new tab) performs mapping of KO genes present in corresponding sample on to the module in question at KEGG website, see Figure 29 for this example that explains how this module was considered complete or incomplete by highlighting KO genes from our sample in red. It shows without previous knowledge it is not easy to mark a KEGG module complete. If a KO id is not highlighted and it is the only KO id in a block of a module, e.g. K00764, this module is considered incomplete since it could not satisfy requirement of such a critical gene. In other cases with multiple KO ids shown for a block in a module, presence of one of those KOs may satisfy completion of that block of the module e.g. the case of K01945 or K01923 or K00602 in the Figure 29.
Figure 29. Example KEGG module completion highlight KO genes for selected sample
Access to interactive heatmap for pathway module completion test is available in compare page. To perform this test select a few samples and select KEGG Pathways (KO) page from the result panel section. For an example see Figure 25 where this test completed and a link (‘view KEGG (KO) module fulfillment’ highlighted in yellow) is shown to view interactive heatmap. In the background upon clicking KEGG Pathway (KO) tab, we obtain KO genes presence absence information for samples being compared and perform pathway module completion test.
A fundamental set of questions is asked about microbes while analyzing metagenomic samples, ‘Who are they?’ and ‘What exactly are they doing?’ translated into taxonomic and functional profiling of samples, respectively. Given above 1 trillion species estimated to be present on planet earth, huge studies and combined community efforts are needed to understand the secrets of microbial world. Earth microbiome project recently published taxonomic profiling of large number of metagenomic samples from 7 continents, 43 countries and 21 biomes with a next step to capture functional profiling of samples.
We proposed use of gene information tables containing taxonomic and extended functional annotations for every gene from a sample or a gene catalogue as this can be very useful for data exchange, conversion and comparisons across analysis platforms. DMAP is able to produce such gene information tables, termed AAMG TSV, that we also index for interactive visualization and comparison of samples through DMAP Compare Module.
A recommended way of looking at gene catalogues, samples are sample groups datasets in DMAP Compare Module is dependent on the type of analysis in mind. For taxonomic profiling of samples, one should select or use one or more relevant single copy genes as a proxy for 16S genes. For functional role comparisons, one could select a filter e.g. KO genes and compare samples for pathways, gene ontology, kegg orthologs, etc. A functional role analysis obtained for a set of samples can be limited to certain taxonomic classes using e.g. to look at a functional analysis limited to only bacterial species, one should include a taxonomic filtering term e.g. blast_tree:2. Similarly, to look up a single or many genes of interest where e.g. KO id is known, one could use search terms ko_id:K00446 OR ko_id:K00448 OR ko_id:K00449. The nice feature of metarep style framework is that for the same data, it is possible to lookup taxonomic or extended functional annotation information just by clicking on a relevant tab in the compare pages.
Sample groups from TARA and Human Gut Microbiome project
By expanding DMAP capability to use more computer nodes with higher memory, taking TARA gene catalog as a common reference alongside gene abundance estimates from each sample, we are able to provide all individual 243 TARA ocean samples online for exploration or comparisons through DMAP compare module, see project 117. To compare a reduced set of samples, we also grouped similar samples considering filter size, ocean type and sampling depth. Of the total provided 47 sample groups produced there are 24 sample groups from prokaryotic enriched (eProkaryote) samples, 13 sample groups from Girus (eGirus) enriched and 12 sample groups from Virus (eVirus) enriched samples.
Regarding Human Gut Integrated Gene Catalog (HG_ IGC) study, there were 1267 published samples for which we collected metadata to label sample names with disease status, geographical origin, gender, age, BMI values and original sample id. An example such sample name looks like Crohns_disease.E.Spanish.F.Age.25.BMI.18.ID.V1.CD1-0. Using portions of HG_IGC gene catalogue, individual samples based gene information tables were produced, based on gene abundance availability of its sample, and indexed all of these 1267 samples for comparison and exploration through DMAP compare module, see Project 118. Furthermore, to analyze reduced set of sample we created groups of samples by aggregating gene information table for unique set of genes belonging to devised sample groups. Sample groups were devised based on disease status, geographic origin and obesity status, as shown in this example sample group name, T2D_E_nonObese where E, A or C represent Europe, America or China. This strategy produced 25 sample groups, see project 115.
First to look at the summary pages of TARA ocean gene catalogue, a snapshot of the interactive summary page from DMAP Compare Module is shown in Figure TARA-1. It shows different specific categories of information e.g. summary, taxonomy, kegg orthologues, pathways etc to look number of genes assigned to relevant categories, explore this and other gene catalogues through DMAP project 119, after clicking on the ‘public access’ button. This projects provides access to gene catalogs from TARA microbial, TARA eukaryotic (TARAeu), Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) and for completeness we include Integrated Gut gene catalog from Human (Human_IGC). A snapshot of top 20 taxons from these gene catalogs, is shown in Figure TARA-2, considering the most abundant single copy gene applied with least common ancestor based taxonomic assignment. DMAP project 119 also contain several other interesting gene catalogs we constructed from ecological metagenomes downloaded through European Nucleotide Archive (ENA).
For comparison of samples or gene catalogs, DMAP provides all standard features of MetaRep. Additional features include interactive heatmaps to either compare pathway modules across samples with confirmed incomplete or complete status (see figure TARA-3) or to explore magnitude difference of unique gene counts or expression from samples being compared (see figures TARA-4,5 ).
Figure TARA-1. Summary of TARA gene catalog for taxonomy or functional role assignments.
To make data analysis interesting we provide several labels, called filters, to focus on one or more subsets of annotations at a time e.g. KO assigned genes or Single Copy Genes (SCGs), or Industrial Proteins of Interest (POIs) and many more. Filter information is stored in the gene information table and also available when exploring or comparing samples using DMAP Compare module, see figures TARA-2,4 where results are shown from selected SCG filter.
Figure TARA-2. Top 20 taxons detected in non-redundant gene catalogs from TARA microbial (TARA_GCv2), Human Gut Microbial (HG_IGCv2), TARA eukaryotic (TARAecCatV2) and Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) considering the least common ancestor approach and the most abundant single copy genes. For exploration or comparison see relevant gene catalogs in DMAP project 119.
By expanding DMAP capability to use more computer nodes with higher memory, taking TARA gene catalog as a common reference alongside gene abundance estimates from each sample, we are able to provide all individual 243 TARA ocean samples online for exploration or comparisons through DMAP compare module. The number of samples here is high so in order to study groups of samples, we also grouped non redundant unique gene counts for similar samples, considering filter size, ocean type and sampling depth. Of the total provided 47 sample groups there are 24 sample groups from prokaryotic enriched (eProkaryote) samples, 13 sample groups from Girus (eGirus) enriched and 13 sample groups from Virus (eVirus) enriched samples, see figures below for example taxonomic and functional role comparisons from TARA sample groups.
Figure TARA-3. Interactive (link) KEGG pathway module test for TARA Ocean sample groups
An interactive heatmap of all TARA sample groups with results from KEGG pathway modules completion test is shown in Figure TARA-3 where a colored cell represents a module is designated the status complete. Example section of this interactive heatmap shows complete or incomplete aromatics degradation modules from KEGG, follow the link provided to explore the status and detail of all the modules. A click on any cell of the heatmap triggers mapping of the underlying sample KO gene information to pathway module diagram at KEGG website leading to clear cut evidence explaining how a module of a pathway is considered complete or incomplete in our interactive heatmap.
Figure TARA-4. A heatmap taxonomic breakdown of enriched prokaryotic sample groups
Figure TARA-5. Interactive heatmap (link) showing fold change (of unique gene counts) for each binary sample comparison for a selected taxonomic level.
The above shown example snapshots from DAMP system entail how the complexity of metagenomic analysis for big studies like TARA ocean can be made slightly easier for non computational scientist with an encouragement to contribute, without worrying about required computational resources, towards a deeper understanding and benefits of the microbial world from planet earth. More true and success stories can be developed from such a huge resource of genetic information from diverse environments and above all everybody from scientific community can contribute using free of cost and online exploration of derived data of this scale.
Human Integrated Gut gene catalogue was established using ~1200 metagenomic samples from healthy subjects and others with certain diseases primarily type II diabetes. This gene catalogue comes with around 10 million unique genes obtained through assembly gene prediction and clustering of genes across 1267 samples covering europe, china and american geography. This is very interesting work with several papers coming out explaining different aspects of microbes on human health and disease and geographic patterns most probably related to diet preferences. A summary of taxonomic and functional role categories of all Human Gut genes is shown figure HG_IGC1. To interactively explore complete or incomplete modules from KEGG pathways in Human Gut sample groups (control vs disease, geographic origin, etc), a snapshot is shown with a link in figure HG_IGC2.
Figure HG_IGC-1. Summary of Human Integrated Gene Catalog (HG_IGC) showing proportion of unique genes available with taxonomy or functional role assignments
Figure HG_IGC-2. Interactive (link) KEGG pathway module test for Human Gut sample groups
European Nucleotide Archive (ENA) provides access to all public Nucleotide data alongside extensive metadata of samples. Metagenomic studies published were all based on best methods available of analysis at that time. Earlier studies were all based on direct shart reads based (mostly incomplete gene prediction) analysis however, with significant improvement in metagenomic assembly methods, recent works are all based on assembled metagenomes leading to recovery of full length genes.
The task for revisiting all previous metagenomic samples, based on recent improvements in assembly methods and best practices in metagenomic analyses, is challenging considering the required computational resources. However, If accomplished, this could lead to an unbiased impressive pool of derived metagenomic data (including full length genes) from different environments of planet earth, making the task of comparative metagenomics and novel gene discovery relatively straightforward.
EBI metagenomics annotation pipeline processes data from ENA with their efficient methods splitting RNA based taxonomic profile and InterPro based functional profile of individual samples. When assembly of a metagenomic sample is available, without any competition EBI metagenomic analysis can be complemented for every predicted gene with features, in addition to InterPro domains, such as taxonomic affiliation from UniProtKB and functional role affiliation based on KEGG Orthology (KO) datasets and associated cross references e.g. see an example table below. Computationally demanding task but if such a gene table could be obtained for all metagenomic samples from planet earth, it can lead to unbiased comparative metagenomics and novel gene discovery.
Table: An example gene table for a sample showing extended annotation (category) columns.
ENA provides an efficient advance search interface to lookup metagenomic samples based on the available metadata. In order to produce gene tables for ENA metagenomic samples, we conducted a pilot study. Here a number of samples were downloaded from ENA where sequencing platform was illumina paired end, protocol was WGS and taxonomic label of samples was ecological metagenomes.
A Gene catalogue, similar to ones from TARA and Human microbiome, serves as a common reference to compare large number of related samples. In our pilot study we construct gene catalogues, one for each of the types of ecological metagenomes obtained from ENA, see DMAP project 119. Individual samples are first preprocessed using bbduk and assembled using MegaHit assembler, followed by prediction of complete genes using Prodigal and clustering of genes into a non-redundant set using CD-Hit producing one gene catalogue for each type of ecological metagenome. These gene catalogues are then annotated using high throughput implementation of BLAST and InterProscan in Automatic Annotation of Microbial or Metagenomes (AAMG) to produce gene information tables, called AAMG TSV gene tables, one for each gene catalogue. In order to produce gene abundance estimates and AAMG TSV per sample, paired end read from each sample are mapped using bbmap onto their respective ecological metagenome gene catalogues. A list of all gene catalogues included in DMAP is presented in a table in a section above, Gene Catalogues and samples in DMAP .
Alam, I., Antunes, A., Kamau, A.A., Kalkatawi, M., Stingl, U. and Bajic, V.B., 2013. INDIGO–INtegrated Data Warehouse of MIcrobial GenOmes with examples from the red sea extremophiles. PloS one, 8(12), p.e82210.
Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E. and Bremges, A., 2017. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature methods, 14(11), p.1063.
Goll, J., Rusch, D.B., Tanenbaum, D.M., Thiagarajan, M., Li, K., Methé, B.A. and Yooseph, S., 2010. METAREP: JCVI metagenomics reports—an open source tool for high-performance comparative metagenomics. Bioinformatics, 26(20), pp.2631-2632.
Schattner, P., Brooks, A.N. and Lowe, T.M., 2005. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic acids research, 33(suppl_2), pp.W686-W689.
Lagesen, K., Hallin, P., Rødland, E.A., Stærfeldt, H.H., Rognes, T. and Ussery, D.W., 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic acids research, 35(9), pp.3100-3108.
Huang, Y., Gilna, P. and Li, W., 2009. Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics, 25(10), pp.1338-1340.
Cole, J.R., Wang, Q., Fish, J.A., Chai, B., McGarrell, D.M., Sun, Y., Brown, C.T., Porras-Alfaro, A., Kuske, C.R. and Tiedje, J.M., 2013. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic acids research, 42(D1), pp.D633-D642.
Deu-Pons, J., Schroeder, M. P., & Lopez-Bigas, N. (2014). jHeatmap: an interactive heatmap viewer for the web. Bioinformatics, 30(12), 1757-1758.
Hyatt, D., Chen, G.L., LoCascio, P.F., Land, M.L., Larimer, F.W. and Hauser, L.J., 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), p.119.
Rho, M., Tang, H. and Ye, Y., 2010. FragGeneScan: predicting genes in short and error-prone reads. Nucleic acids research, 38(20), pp.e191-e191.
Zhu, W., Lomsadze, A. and Borodovsky, M., 2010. Ab initio gene identification in metagenomic sequences. Nucleic acids research, 38(12), pp.e132-e132.
Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G. and Pesseat, S., 2014. InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), pp.1236-1240.
Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J.R., Prifti, E., Nielsen, T. and Juncker, A.S., 2014. An integrated catalog of reference genes in the human gut microbiome. Nature biotechnology, 32(8), p.834.
Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E. and Bremges, A., 2017. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nature methods, 14(11), p.1063.
Sunagawa, S., Coelho, L.P., Chaffron, S., Kultima, J.R., Labadie, K., Salazar, G., Djahanschiri, B., Zeller, G., Mende, D.R., Alberti, A. and Cornejo-Castillo, F.M., 2015. Structure and function of the global ocean microbiome. Science, 348(6237), p.1261359.