KAUST Metagenomic Analysis Platform

For the Impatient, KMAP available at http://www.cbrc.kaust.edu.sa/kmap is a public resource providing annotations and comparison of genomic or metagenomic sequence data through its Annotation and Compare Modules, see figure 1 below for KMAP workflow. Results for over 275 million genes annotated from several existing and new metagenomic gene catalogues are available as gene information AAMG TSV tables containing extended taxonomic, functional role and industrial application assignments. These results are available for download, online viewing, (see Table 1 below for a complete list and links) or analysis through KMAP Comparison Module project 119.  

Table of Contents

Introduction

Access to KMAP and Example Use Case

Public Access

Private Access

CAMI data, a use case for KMAP

Gene Catalogues and samples in KMAP

KMAP Modules

KMAP Annotation Module (KAM)

Input to AAMG

Input sequence types and options

Locus Tag

BLAST bitscore cutoff

AAMG Job Submission and Job History

AAMG Overview

Gene Prediction

Gene Annotation

Gene Annotation Thresholds

Gene Annotation and Parallel Computing

Gene Annotation Limitation

Purpose built databases

Bioactive peptides

Industrial Proteins of Interest (POIs)

Combining AAMG based Annotation and WWW visualization

AAMG Results in GFF, FASTA and TSV

Annotated Gene Subsets as Filters

Integration of Gene Abundance or Read Mapping data

AAMG Results through Interactive Visualizations

Annotation Features

Retrieve Sequences

Contig Binning and Metagenomic Species Binning

COGs Binning

Taxonomic Assignments

Functional Role Assignments

Annotation Data Files

KMAP Comparison Module (KCM)

Analyze samples in a Project

Single Sample exploration

View Sample

Search Sample

Browse Sample for Taxonomy, Pathways, Gene Ontology, Enzymes

Multiple Sample Comparisons

Absolute Counts

Visualization of Metagenomic Sample Comparison

Heatmap plot (and fold change)

Interactive Heatmaps

Stacked Bar Plot (and fold change)

Pathway Comparisons

Statistical Tests (two samples)

Interactive KEGG Pathway Module Completion Test

Examples of Metagenomic Comparisons

Tips for using KMAP Compare Module

Samples and Sample Groups

Examples from TARA ocean

Human Integrated Gut Microbiome.

ENA ecological metagenomes.

Gene catalogues from ecological metagenomes

References


Introduction

KAUST Metagenomic Analysis Platform (KMAP) has a variety of  useful resources for genomic and metagenomic analysis, categorized into two easy to understand modules namely Annotation module and comparison module. Users interested in already annotated gene catalogues or samples can use KMAP Comparison Module (KCM) for browsing or comparison of samples at various biological hierarchies  such as taxonomy, gene ontology (GO), Kegg Pathways, Kegg Orthologs (KO), Enzymes, etc. Other users interested in the analysis of their own genomic or metagenomic samples can make use of KMAP Annotation module (KAM) available via improved version of our pipeline for Automatic Annotation of Microbial/MetaGenomes (AAMG, Alam et al., 2013). KMAP workflow is shown in Figure 1 below.

Figure 1. KMAP Workflow.

The following documentation will help you explore KMAP Annotation and Comparison modules through some use cases. We first describe access to KMAP and public projects.


Access to KMAP and Example Use Case

KMAP is accessible free of charge without any need to login at the following URL: http://www.cbrc.kaust.edu.sa/kmap , following is a snapshot of the mainpage:

Figure 2. KMAP main page showing KMAP options to access this system.

Public Access

Clicking on ‘Public Access’ on the main page takes you to project area showing publicly available projects and an option to annotate your own sample data through AAMG, as shown in Figure 2 below.

Figure 3. KMAP Project area for Public Access

Private Access

We recommend creating a user account if you wish to use your own data, otherwise Public Access is sufficient to understand the capabilities of KMAP and to access public projects.  To create a user account, click on ‘Register’ link on the main page, fill up all the required details and submit. For account activation, please send an email to us at allan.kamau@kaust.edu.sa , once your account is activated you can log in using your username and password to submit your data for annotation via AAMG, access your annotation jobs history or access private or public projects assigned to you for browsing or sample comparisons.

CAMI data, a use case for KMAP

We are using CAMI (Sczyrba et al., 2017) data as a use case to explain capabilities of KMAP using metagenomic contigs as input. To this end we annotated already available golden assemblies from Low, Medium and High complexity CAMI metagenomic samples using AAMG pipeline available in KMAP. For browsing and sample comparison, a gene information table, referred as AAMG Tab Separated Values (TSV) based Annotation results. These data are lucene indexed and uploaded into KMAP Comparison Module (KCM). The KMAP Comparison Module (KCM) is based on an improved version of Metarep (Goll et al., 2010).

The peptide annotations generated through KMAP annotation module can be downloaded in AAMG TSV and other formats (see section on AAMG output) for use in analysis through other metagenomic sample comparison systems.

 For local use or analysis through any other metagenomic sample comparison system, annotations generated through KMAP annotation module can be downloaded in AAMG TSV and other formats, see section on AAMG output.

Gene Catalogues and samples in KMAP

We provide access to KMAP based gene information tables containing extended annotations for over 275 million genes obtained from several existing public metagenomic gene catalogues alongside a number of new gene catalogues obtained from 36 different  ecological metagenome types obtained from European Nucleotide Archive (ENA).

Large public studies

Several public metagenomic gene catalogues from microbial or micro eukaryotic origin were annotated using AAMG protocol. This includes large gene catalogues from TARA Ocean microbial and micro eukaryotic studies, Metagenomic Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) and Human Gut Microbiome. In case of microbial gene catalogues, Coding DNA Sequence (CDS) input option in AAMG was used. However in the case of micro eukaryotic gene catalogues, protein sequence input was used. Access links to complete AAMG output, AAMG TSV and Industrial POIs are provided in the table below.

ENA public metagenomes

In our first pilot study, with a long term aim to provide gene information tables and mining of industrial POIs for all public metagenomic samples deposited to European Nucleotide Archive (ENA), we started downloading fastq files for different types of ecological metagenomes from ENA. We produce gene catalogues for every metagenome type by first assembling individual metagenomes, followed by gene prediction and clustering of full length genes. Reads are mapped from each sample to the relevant gene catalogue to obtain gene abundance information. Gene catalogues are annotated using KAMP Annotation module. As of July 2018, we provide gene information tables and complete annotation for 36 metagenome types, see KMAP supplementary table for included gene catalogues and links to annotation resources.

KMAP compare module provides access to Graphic User Interface (GUI) for metagenomic data view, browsing, or comparison. All gene catalouges we annotated are available for access in KMAP Project 119, http://www.cbrc.kaust.edu.sa/kmap/projects/view/119.

User samples

User submitted assembled metagenomes or gene catalogues can be annotated through KMAP and upon user request such datasets can be indexed and provided for public or private access. A number of metagenomic samples were annotated and indexed for user specific projects with private access. These projects will become public on user request most probably at the time of publication.


KMAP Modules

We use CAMI data in our use case to describe different aspects of two KMAP modules. First the Annotation Module and secondly the Comparison Module.

KMAP Annotation Module (KAM)

An overview of KMAP module is provided in Figure 1, showing types of input accepted, mechanism of gene annotation and integration of results. More details are provided in the relevant sections below.

KMAP Annotation Module is accessible to users through the AAMG tab visible in KMAP interface at the top left side, see Figure 3, after public or private access. Clicking on the AAMG tab shows an interface to submit user sequence data for annotation, or explore job history, see Figure 4 below. Job history is not available for public access or guest user.

        

Figure 4. AAMG input sequence submission options.

Input to AAMG

AAMG accepts Coding DNA sequences (CDS) and protein sequences from any source. AAMG also accepts genome or metagenome annotation data, these data are required to be chromosomes or contig sequences from a microbial source. This is because the gene prediction methods employed by AAMG at present are “microbial only” gene predictions methods.

AAMG accepts input sequence data from any source in case of proteins or Coding DNA Sequences (CDS). However for a genome or metagenome annotation it requires chromosomes or contig sequences from a microbial source due to microbial only gene prediction methods implemented at present. As shown in Figure 4, AAMG submission interface provides three input types and additional options accordingly.

                        Input sequence types and options

  1. Microbial complete genome or contigs as input. This input type shows an option to select the source of data either as Bacteria, Archaea, Viruses or a Metagenome. In addition it shows gene prediction method choice to select one of Prodigal, FragGeneScan or MetaGeneMark. Taxonomic lineage, as much as known, is also required.
  2. Coding DNA Sequences (CDS) as input. This option accepts input gene CDSes from any source, be it from a microbial source or eukaryotes, for example a gene catalog obtained from several related metagenomes.  CDSes will directly be translated into proteins for processing through AAMG. Taxonomic lineage, as much as known, is also required. This option is useful to annotate large gene catalogues obtained through clustering of genes from several related metagenomes.
  3. Protein sequences as input. This option accepts input proteins from any source e.g. gene catalogues obtained from several related metagenomes, etc. Taxonomic lineage, as much as known, is also required. In case of Metagenomes, users can write the lineage information as the following example: Metagenomes;Ocean;Atlantic Ocean;Time Series; AOTS. This aption allows users to annotate protein sequences obtained from

There are two additional fields which are needed, one is Locus Tag and the other as BLAST bitscore cutoff.

Locus Tag

Locus Tag is a unique string you assign to your annotation run so that you can identify the type of data you annotated. E.g. if your annotation is for a dataset such as CAMI, the low complexity metagenome and you choose Prodigal as a gene prediction method, you could set ‘CamiLowProdigal’, as string from Locus Tag option.

BLAST bitscore cutoff

There are several ways to restrict the list of BLAST based matches when comparing a protein sequence obtained from an input to AAMG against reference databases such as UniProt and KEGG KOs.  One could use either an E-value, Percent Identity or BLAST bitscore. We use BLAST bitscore inclusive lower bound cutoff of 50 to obtain a reasonable list of matches. Later we provide a mechanism to restrict the results in KMAP compare module based on E-value, Percent Identity and/or Coverage. In this BLAST bitscore cutoff option at AAMG submission stage a user can select BLAST bitscore of 50 or higher.

AAMG Job Submission and Job History

When all options are provided at the AAMG submission page, the user can upload their input sequence file and click on Submit Dataset, as shown in the example below in Figure 5. After a datasets has been uploaded, the webpage displays the summary information such as file size of the uploaded data. This is the stage the user can now submit the annotation job for processing by clicking on ‘submit for processing’. This page is updated with the information on the sequences submitted and provides a job id for this run.

AAMG annotation job history is not available when in public access mode. However, when a user is logged in using their KMAP account, link to ‘View History’ provides the list of successful AAMG jobs submissions with links to view AAMG output for each Job Id.

Figure 5. AAMG interface job submission section.

AAMG Overview

Automatic Annotation of Microbial or Meta Genomes is the core of KMAP annotation module. Input and options accepted through KMAP annotation module are passed to AAMG for comparison to public RNA or protein reference databases. It also includes profiling of genes to our purpose built Bioactives and Industrial Proteins of Interest (POI) datasets. Results from comparison to multiple sources are analyzed and integrated to provide robust taxonomic and functional role annotations. The following sections provide specific details of AAMG processing of input data through different stages of annotation.

Gene Prediction

In case of input as assembled sequence data, in the form of chromosomes or contigs from a microbial source, AAMG first kicks in RNA prediction and masks the input DNA sequences ready for protein coding gene prediction. The RNA prediction stage includes tRNA genes such using tRNAscanSE (Schattner et al., 2005) and rRNA using RNAmmer (Lagesen et al., 2007). Recently improved AAMG includes high throughput metagenomic rRNA prediction using (Huang et al., 2009)  and also non-coding RNA (ncRNA) prediction using Infernal (Cole et al., 2013).  RNA prediction is followed by user selected protein coding open reading frames (ORFs) methods. Available choices are Prodigal (Hyatt et al., 2010), FragGeneScan (Rho et al., 2010) or MetaGeneMark (Zhu et al., 2010).

In other input cases, both RNA or protein ORF gene prediction steps are not carried out. In these cases sequences are accepted from any source, for example a non-redundant gene sequence catalogue from several metagenomic samples, or gene sequences from a non-microbial sources, etc. Only additional step in case of  input being protein CoDing Sequece (CDS), is the standard translation to protein sequence. Once protein version of gene sequences become available AAMG gene annotation process kicks in.

Gene Annotation

Annotation of genes is performed by comparing predicted genes to reference databases. In order to obtain the closest possible taxonomic information, AAMG performs a blastn search for Ribosomal RNA genes to RNA genes obtained from NCBI and similarly Infernal based ncRNA genes are compared to RNA Family (RFAM) reference sequences from EBI.

Protein coding ORFs comparison to reference databases is carried out using highly parallel blastp implementation segmenting the input into equal size fragments. ORFs are first compared with UniProtKB database (www.uniprot.org) to seek relative taxonomic information, eggNOG (http://eggnogdb.embl.de) references and generic functional role information. For a more robust functional role assignments, ORFs are compared to subset of KEGG (www.kegg.jp) database sequences where KEGG Ortholog (KO) identifier information is readily available. Finally, ORFs are compared to protein functional signature databases in InterPro using InterProScan (Jones et al., 2014) leading to Gene Ontology (GO), Pfam domains, TigrFAM domains and Prosite signatures assignments.

Gene Annotation Thresholds

To control the quality of matches during gene annotation process, AAMG applies a user provided BLAST alignment or bitscore threshold, with minimum allowed bitscore of 50 and BLAST e-value of 1e-2. Later at the stage of parsing and integrating BLAST based results from multiple sources,  a minimum percent identity of 30 and minimum percent query coverage of 50 is enforced.

Gene Annotation and Parallel Computing

The process of gene annotation consists of three main steps, first the computationally most expensive step of alignment to reference databases, secondly processing of the alignment reports and finally combining annotations from multiple sources. The time complexity of these steps increases with the increasing number of query sequences. The alignment step becomes computationally very demanding when the size of reference databases is large, e.g. UniProtKB provides around 100 million sequences. To speed up our alignment step, we produce equal size (e.g. 2MB) subsets from the query set of sequences and for all subsets we perform BLAST or InterProscan jobs in parallel.

Slide09.jpg

As shown in the above Figure, taking example of TARA oceans gene catalog with 40 million genes, we produced ~6000 subsets leading to the same number of jobs for e.g. BLAST running in parallel. Average run time of jobs varied slightly but all BLAST jobs reached completion in less than 24 hours. To speed up the subsequent BLAST report parsing, we process reports from all subsets in parallel, read processed results in memory and write the final filtered output for further processing. This requires large computing resources and we make use of KAUST’s high performance computer cluster, Ibex with around 50,000 cores, and KAUST’s supercomputer, Shaheen II with around 200, 000 cores.  

Gene Annotation Limitation

Please note that AAMG uses KAUST’s Ibex cluster to process user data free of charge. In cases where predicted genes exceed 1,000,000 AAMG is able to carry out expensive annotation steps using KAUST’s Shaheen II resource equipped with ~200,000 processors. In order to process large datasets, where access to Shaheen II is required, a special arrangement can be made upon user’s request. Otherwise it is recommended to split your large datasets into reasonable sizes such that estimated number of genes are below 1 million.

Purpose built databases

We prepared two purpose built databases containing references to important genes with a potential of a bioactive peptide or relevance to a particular industry. User data is profiled against these databases to pinpoint important genes present in a genome or metagenome being annotated through AAMG.  

Bioactive peptides

We search UniProtKB data for genes curated with certain types of keywords relevant to bioactive peptides such as antimicrobial, antiviral, antibiotic biosynthesis, bacteriocin, toxin, bacteriolytic enzyme, polyketide synthase, secreted, etc. This keyword information is placed in the sequence header of UniProtKB sequences and relevant bioactive keyword groups are reported during gene annotation process based on BLAST results for genome or metagenome under AAMG processing.

Industrial Proteins of Interest (POIs)

We searched several resources, including research papers and public databases, e.g. brenda, to obtain a collection of enzymes or proteins, with potential role in or relevance to a particular industry, linking KO ids or enzyme classification numbers to relevant industries, we call this resource Industrial Proteins of Interest (POIs). There are several types of industries for which we have a collection of relevant genes. Example of industries are shown in Figure 6, below.

Figure 6. Groups of Industrial Proteins of Interest (POIs).

During the integration of gene annotations, we lookup our POIs resource and associate relevant information to genes from a genome or metagenome annotated by AAMG.

            Combining AAMG based Annotation and WWW visualization

AAMG Results in GFF, FASTA and TSV

Results obtained through gene prediction followed by gene annotation process are analyzed to obtain both taxonomic and functional role assignments. Initially, BLAST or InterProscan results from multiple sources are parsed to keep first level of result files preferably in GFF format. Related specific taxonomic or functional assignment results are saved per reference database, see Figure 13 for snapshot of example result files. Finally, all of the derived results are integrated into AAMG TSV, a comprehensive 19 columns table, see Table 1 for an example. This data integration is based on annotations from multiple sources for each gene considering minimum percent identity of 30 and minimum percent query coverage of 50.  

AAMG TSV is similar to MetaRep (Goll et al., 2010) tab format data file. AAMG TSV stores annotation information for a genome or metagenome using unique gene and a sample ids. In cases where multiple values are available for a particular annotation column, values are compiled together using the “||” delimiter for later use.

 

Table. AAMG TSV table columns and examples

Taxonomic assignments based on UniProtKB or KEGG BLAST results, we call Gene Based Taxonomy (GBT) assignments, are obtained using both BLAST best hit approach and also the Lowest common ancestor (LCA) approach but AAMG TSV shows only the LCA based taxonomic assignment. Our GBT implementation first sorts BLAST results based on query length normalized percent identity. In the best hit approach the taxonomic information for the top hit is saved. However, in LCA  approach top alignments, with a difference of no more than 10% of the best hit bitscore, are processed to obtain a taxonomic level where taxonomic information from top alignments converge. BLAST LCA based results are saved in the AAMG TSV file however both best hit and LCA results are available separately in files marked as GBT in the name. AAMG TSV columns such as Taxon ID, Evalue, Percent Identity and Percent Coverage are populated using hits, in the order of higher to lower stringency, from either UniProtKB or KEGG.

Robust functional role assignments obtained, through KO identifiers available for best hits from KEGG KO BLAST results, are utilized to populate AAMG TSV columns such as Gene Name, Gene Name Source, EC ID, EC ID Source, KO ID and KO ID Source. If no such assignments can be obtained, generic gene descriptions for Gene Name and Source column are obtained from UniProtKB BLAST results. COG and eggNOG columns in AAMG TSV are populated with relevant information, when available, from eggNOG cross-references to UniProtKB hits. Additionally, InterProScan based results are used to populate GO ID, GO ID Source and InterPro ID fields.  

Gene Name Source column captures all sources of annotation e.g. source chromosome or contig on which gene was predicted, source hit IDs from UniProtKB, KEGG or InterPro. In addition it stores the BLAST scores for KEGG based hits. Usually AAMG TSV taxon Id, e-value, percent identity and percent coverage information is obtained from UniProtKB, unless KEGG based hits show better scores.

Annotated Gene Subsets as Filters

To analyze explosion of gene annotations from multiple sources, AAMG provides an easy way to study a more focussed subset of genes at a time. In AAMG analyses we produce subsets or ‘Filters’ based on grouping annotations into several categories, marked here with alphabets from A-J, as shown below. These categories can have sub categories for studying the data with more resolution. Information of ‘Filter’ assignment to each gene is recorded in column 14 of AAMG TSV.

  1. Annotated or notAnnotated
  2. KO
  3. Enzyme
  4. InterPro
  5. COG
  6. Bioactives with sub categories
  7. POIs with sub categories
  8. eggNOG
  9. Single Copy Genes (SCG) with sub categories e.g. Universal, Prokaryotic, etc.
  10. Viral Marker Domains with sub categories.

These filters can be used to focus on a subset of genes for a particular analysis in mind. Our KMAP Sample Comparison Module, see the relevant section below, provides exploration or comparison of samples where focus on subset of genes is possible by selection of filters, a few examples with a filter selected are shown in Figures 17, 21, 25.

Integration of Gene Abundance or Read Mapping data

AAMG TSV has a specific column, called weight, to include numerical gene abundance or read mapping information, see column 17 in Table 1. This weight is counted as 1, if no information is provided. Otherwise this field can be populated using normalized read counts e.g. FPKM, etc. This is very useful in comparing multiple genomes, transcriptomes or metagenomes. It becomes more useful data reduction and comparison approach when multiple related metagenomes are being compared based on a common gene catalogue, see the section on gene catalogue based metagenomic comparisons below.

 

                AAMG Results through Interactive Visualizations

AAMG computational processing is carried out on our High Performance Computing (HPC) cluster, Ibex. During the annotation process BLAST and InterProscan results are processed and saved for various taxonomic and functional role assignment visualizations, see breakdown of these in the following section. Once the processing of an annotation project is complete, all results are transferred to our AAMG visualiation server, http://www.cbrc.kaust.edu.sa/aamg/{job.id}.

Each annotation project receives a unique job id based on ‘LocusTag’ and user credentials. Such a job id becomes a basis for the unique www url for each individual project. In our use case here, we used a locus tag ‘CamiLowProdigal’ and AAMG generated the following job id,  1502720332816_CamiLowProdigalV2_40.0_intikhab. Complete annotation result files alongside various visualizations for this and other two example projects are saved at the following URL:

http://www.cbrc.kaust.edu.sa/aamg/1502720332816_CamiLowProdigalV2_40.0_intikhab/

http://www.cbrc.kaust.edu.sa/aamg/1502720420145_CamiMediumProdigalV2_40.0_intikhab/

http://www.cbrc.kaust.edu.sa/aamg/1502720475383_CamiHighProdigalV2_40.0_intikhab/

A few example annotations and visualizations are explained the following section. As shown in Figure 7 below, AAMG provides a user friendly interface to results for an annotation project. It shows the following four sections or tab, namely ‘Annotation Features’, ‘Taxonomy Assignments’, ‘Functional Assignments’ and ‘Annotation data files’.

 

Annotation Features

The ‘Annotation Features’ tab provides counts of different kinds of genes predicted and their annotation through several sources including UniProt, KEGG, COG, InterPro and Gene Ontology alongside a count of total annotated and unassigned genes. Three additional features available on this page are a) Retrieve Sequences, b) Perform Contig Binning and c) Perform COGs Binning.

Figure 7. A snapshot of AAMG output web page.

Retrieve Sequences

To obtain sequences from an AAMG annotation project, it requires a project id and a list of contig or gene identifiers. Clicking on ‘Retrieve Sequences’ link, from the main ‘Annotation Features’ tab of a project, opens an interface to accept project Id and list of sequence identifiers leading to download of related DNA or protein sequences. Sequence identifiers are available while exploring AAMG result data files.

Contig Binning and Metagenomic Species Binning

Pushing the ‘Perform Contig Binning’ button while selecting a particular taxonomic level on the ‘Feature Counts’ page initiates the contig binning process of the corresponding project. During this process, Gene Based Taxonomy (GBT) information is summarized for each contig in such a way that contig bins are formed where the majority of genes on a contig belong to a common taxonomy, as shown in Figure 8,  at the user selected taxonomic level e.g. phylum or class, etc. Those contigs are considered clean where majority, above 65%, of genes converge on a common taxonomic assignment while others are marked novel or potentially chimeric for further investigation.

Figure 8. An example contig binning report

A contig binning report, as shown in Figure 8, explains for each contig bin the total length of contigs, number of contigs and percent of genes assigned to taxonomy of this bin at the user selected level. In case of gene catalogues, e.g. from TARA ocean or Human Gut Microbiome or else, this binning is entirely based on genes not contigs and it results in full collection of diverse set of genes which belong to taxonomic groups at a selected taxonomic level, leading to potential Metagenomic Species (MS) bins. A list of gene catalogs annotated using KMAP with potential MS bins is available at http://www.cbrc.kaust.edu.sa/aamg/KMAP_Data/ .

 

COGs Binning

Another binning option provided by AAMG is COGs binning. In this binning approach, AAMG annotated genes with COG assignments are collected and summarized into higher COG functional classes, as shown in Figure 9, from the corresponding genome or metagenome project. The COGs binning results from multiple genomes or metagenomes can become the simplest approach to compare functional potential.

Figure 9. An example COGs binning report.

Furthermore, Taxonomy Assignment and Functional Assignment tabs provide access to each sub type of annotations in the form of top ten hits, links to interactive Krona graphs and their background summary and full tables, a snapshot of these tabs is shown in Figure 10 below. The ‘Annotation Data Files’ tab provide access to download result files.

Figure 10A. Taxonomy Assignments.

Figure 10B. Functional Assignments.

Figure 10C. Annotation Data Files.    

Taxonomic Assignments

Taxonomic detail of a dataset is provided based on Gene Based Taxonomic (GBT) annotations, see Figure 10 A. In the case of CDS or proteins as input, only protein coding genes based results are provided. However in the case of chromosomes or contigs as input, RNA based taxonomic assignments are also included as shown in Figure 10A and also indicated in the workflow of input to KMAP Annotation module in Figure 1. Combined taxonomic assignments section shows results from the integration of taxonomic assignments from all sources. An example from such an integration is shown as interactive Krona graph in Figure 11 below. Clicking on this image or on the hyperlink in the figure legend will lead to the actual interactive example. These are searchable interactive graphs allowing a user to select depth of a hierarchical level under investigation for which a breakdown of assignments can be explored.

Figure 11. An interactive Krona Graph for taxonomic assignments.

Complete background summary count data, used in creating such an interactive Krona graph, is available for download. A non-summarized version of this data is also available as raw TSV file, see for reference the last section of Figure 10 A. Similarly other taxonomic assignment tabs provide links to actual data and interactive graphs in relevant categories, see example links in the Table below.

Taxonomy assignments and example hyperlink

1

Combined Gene Based Taxonomy (GBT) assignment example.

2

Protein Coding GBT assignment example.

3

RNA GBT assignment example.

4

RFam GBT assignment example.

(Please click on Taxonomic Assignments tab once you reach the above hyperlinks).

Functional Role Assignments

Robust functional role assignments in AAMG are obtained through comparison of gene sequences to specific gene function reference databases and available relevant cross references therein. There are several categories of functional role assignments available in AAMG results, including GO, Industrial POIs, COG, KEGG, Bioactives and InterPro domains. An example interactive graph for KEGG classification of our example use case CAMI data is shown in Figure 12 below.

Figure 12. An interactive example of KEGG functional hierarchy graph

 

Users are able to see interactive krona graphs for each of these categories alongside options to download background summary counts and unsummarized gene counts data as TSV tables, see lower section of Figure 10B and the following hyperlinked categories from our use case:

(Please click on Functional Assignments tab once you reach the above hyperlinks).

Annotation Data Files

Complete AAMG annotation results data is available either as individual files or a tar.gz archive from ‘Annotation Data Files’ tab, see Figure 10C for an example. A click on ‘result files’ link on this tab leads to an interactive Krona graph showing available files for download such as BLAST results in GFF, InterProScan results as domains gff, AAMG TSV and other specific GFF or Fasta files, see Figure 13 for an example. Clicking on a category representing a file name shows a link to download and size of the file.

Figure 13. Annotation result files graph

Exploring KMAP based results from a single genome or metagenome in itself is very interesting showing the breadth of taxonomic and functional repertoire. For comparison of multiple samples, genomes or metagenomes, it requires a comprehensive sample comparison system. Passing raw data through a single annotation pipeline with reference databases updated on a similar timescale makes multiple sample comparisons more comparable without any technical bias. See section on KMAP Sample Comparison Module below for an insight into multiple samples based analysis.


        KMAP Comparison Module (KCM)

KMAP Annotation Module provides comprehensive annotation of genes based on several reference datasets and its AAMG TSV format table output can easily be used for local analysis or an exploration based on user’s preferred analysis system. We provide exploration or comparison of multiple genomes, metagenomes or transcriptomes through KMAP Comparison Module, powered by MetaRep (Goll et al., 2010). KCM is deployed on several high memory computers and it provides a few additional features on standard MetaRep including interactive KEGG Pathway module completion exploration, other interactive heatmaps and implementation of magnitude or fold change filter.

To understand how to access KMAP, please refer to section on public or private access KMAP above. Once a user is able to get access into KMAP, initial screen shows a list of available public or private projects, see  Figure 3 for an example. KCM analysis options are accessible once a project is selected through ‘view project’ link.

Analyze samples in a Project

Clicking on ‘view’ hyperlink in the project list page leads to ‘View’ section of a project and the ‘View Project’ page shows, see Figure 14, sample metadata as well as options for analyses..

Figure 14. View section of a project showing access to Analyze options.

The ‘Analyze’ menu in the View Project allows a user to either view, search or browse a single sample for hierarchical classification of taxonomy, kegg pathways, enzymes or gene ontology, etc. Compare option in ‘Analyze’ menu provides users with options to select multiple samples and compare their taxonomic or functional profiles to produce a comparative table or a plot. Here below we show a few examples of these options.

Single Sample exploration

View Sample

Options available to explore a single sample are view, search and browse. Sample View option gives access to annotations for all genes, showing at least 20 genes on every page, see Figure 15 for an example.

Figure 15. A snapshot of sample ‘view’.

This page also gives access to look at ‘Summary’ of annotations for the complete sample in terms of assigned taxonomic, Gene Ontology, Enzyme, KEGG Ortholog or total genes. Similarly summary of taxonomic assignments from different domains of life is also shown.

Figure 16. View summary of annotations for a sample.

In addition, on this ‘View’ sample page, one is able to view counts of abundant taxonomic of functional elements in their respective category, see Figure 16. The ‘Filter’ option allows to look at subset of data by selecting a value from the ‘Filter’ menu. For an example, see Figure 17, to look at function and taxonomic summaries for genes assigned with KEGG Ortholog (KO) id, filter ‘KO’ can be selected. Similarly, other ‘Filters’ display results specific to a set of genes where the selected filter is available.

Figure 17. Summary of annotations for ‘Filter’ KO.

Search Sample

The second option available Analyze submenu is to perform search into a particular sample. By default it shows a result based on *.* in the search box. Default results are split into three sections namely top 10 classifications, top 10 pie charts and all search results in text. A help dialogue box, shown on top of the default search results, explains different search options and search procedure. See example search results in Figure 18 below.

For example, to search fixed database IDs e.g ko_id or GO_tree, etc., a search terms like the following will work:

ko_id:K01692 AND go_tree:3824

To restrict the above query, e.g. to bacteria, you would extend the search terms as the following:

(ko_id:K01692 AND go_tree:3824) AND blast_tree:2

 

Figure 18. KCM example showing search annotations of a sample

Browse Sample for Taxonomy, Pathways, Gene Ontology, Enzymes

Browse facility in KCM educates a user to understand hierarchical classification of biological data from Taxonomy, Pathways, Enzymes or Gene Ontology annotations. It provides an interface to select a particular hierarchical level for which results are shown in terms of pie charts and top 10 classifications.  For an example see Figure 19 where using our use case CAMI sample when taxonomic level bacteria is selected, it shows the breakdown of child taxonomy as a list with counts of genes and as proportions in the main pie chart. It also shows top 10 functional classifications and their respective pie charts. A change in the selection of taxonomic class updates results instantly.

Figure 19. An example result from Browse Taxonomy.

Similarly it shows breakdown of sub categories and classes for Pathways, Enzymes and Gene Ontology, try selecting relevant option from Analyze menu of a sample in Project View page or click on example hyperlinks from the table below.

Browse (hyperlinked) examples

1

Browse Taxonomy example

2

Browse KEGG Pathways example

3

Browse MetaCyc Pathways example

4

Browse Enzymes example

5

Browse Gene Ontology example

Multiple Sample Comparisons

The most exciting feature of KCM is Compare Samples. This is a standard feature available in metarep but here we included additional sub features mentioned below.  Compare feature is available from the Analyze menu on the View project page. Once selected, it shows an interface to add or remove samples, apply search patterns and terms and view results as a table or figures.

Figure 20. KCM Compare Samples Interface.

Absolute Counts

In this figure 20 a breakdown of taxonomy is shown for selected samples at the phylum level with visualization option selected as ‘absolute counts’. In this example case or in case of a genome or metagenome, this absolute count value represents number of genes found in a particular sample. In case of transcriptomes or gene catalog based read counts, this absolute count value represents read counts which may already be pre-normalized.

Also, please note the text box in ‘Filter Dataset’ section in the figure above where 16S_rRNA is selected, this is the section that allows us to focus on specific types of annotation subsets, explained as filters in the KMAP Annotation Module.

Visualization of Metagenomic Sample Comparison

Comparison of samples at KCM can also be visualized using several graphs or plots as listed in the options menu on the compare page, namely hierarchical cluster plot, heatmap, multidimensional scaling or stacked bar plot, etc. Different graphical views of the same starting data help analyze a sample comparison more efficiently.

                Heatmap plot (and fold change)

Let us use the example of sample comparison from figure 20, showing results as a gene counts table. Selection of heatmap plot from the options menu, instead of absolute counts, produces a heatmap, see Figure 21 for this example. Heatmap result page offers options to change the underlying ‘distance matrix’, ‘clustering method’, ‘dataset label’, color scheme of the plot and recently added options for ‘interactive heatmap’ and fold change thresholds.

Figure 21. Example of sample comparison as a heatmap

The aspect of fold change is quite interesting to study the magnitude of difference in selected samples in terms of diverse gene or expression counts for taxonomic or functional categories under investigation. Using our example sample comparison from Figure 21, when we apply a fold change value of 4 it results in, see Figure 22, filtering out taxonomic phyla with gene counts below average or below 4 fold magnitude.

Figure 22. Example of heatmap with fold change applied.

Interactive Heatmaps

Static heatmaps enable use to visually see differences in abundance of gene or expression counts, however to dig deeper e.g. in case of fold change estimates among each pair of samples being compared, we added interactive heatmap option, see example Figure 23 for data presented in Figure 22. Interactive heatmaps option, using jheatmap (Deu-Pons et al., 2014), is available via a hyperlink placed just above the plot window. A click on any cell of the interactive heatmap shows a set of samples being compared and shows the scale of fold change annotation categories under investigation.

   

Figure 23. Interactive Heatmaps to study binary fold change information.

                Stacked Bar Plot (and fold change)

To look at sample comparisons from a different perspective we recently added stacked bar plot option. Stacked bar plot offers to present a comparative view of samples showing proportion of contributions from different subclasses for a selected level of taxonomic or functional class, see example figure 24. Items contributing less than 5% of contributions are placed in category, other. The fold change option is also included. Interactive heatmaps are also available here to find out exact fold change or magnitude differences for every set of samples and every functional class being compared. Links to interactive heatmap, marked as ‘view heatmap’, are kept right on top of the window showing stacked bar plot.

 

Figure 24. Compare Samples visualization through stacked bar plot

Pathway Comparisons

KMAP Comparison module built on standard metarep (Goll et al., 2010) framework provides extensive functional aspect of sample comparisons. This includes KEGG Pathway (KO), KEGG Pathway (EC), MetaCyc Pathway (EC), exploration of Enzymes, KEGG Orthologs and Gene Ontology (GO). Figure 26 shows an example of KEGG Pathway (KO) gene counts based comparison of our use case CAMI metagenomic datasets with Low, Medium and High taxonomic Complexity. In this example number of different KO assigned genes from all samples are shown for different categories of pathways at root level. Other available pathway levels to view are super pathways and pathways. Results can be visualized as tables or plots similar to taxonomic visualization option. It is a very interesting feature of standard metarep and KCM that for the same data set, e.g. genes assigned with say KO information can easily be compared across samples for taxonomy, pathways, gene ontology and enzymes by just clicking on the relevant tab in the result panel of the compare page.

Figure 25. KEGG Pathway KO comparison of samples.

Statistical Tests (two samples)

There are several different procedures or options available in KCM to analyze multiple samples from a genome or metagenome for taxonomic or functional annotation categories. One interesting method implemented here is to compare two samples by applying one of the  statistical tests from available fisher exact test or equality of proportion test. Figure 26 shows an example of Fisher Exact Test applied for KEGG Pathway (KO) category of comparisons at level pathway. It shows counts and statistics for two samples compared for each of the pathway detected in samples under investigation.

Figure 26. KEGG Pathway (KO) Fisher exact test

Although these statistical tests can be applied to both taxonomic and functional comparison, there is one additional feature when pathways are compared, see a small figure at the end of the resulting table, clicking on this opens up a comparative analysis page showing two sections, see Figure 27 for an example. The top section of this result shows KEGG Pathway diagram mapped with KO gene counts (number or expression counts), as light or dark colors, detected in one or both samples. Density of the color is related to the density of gene counts or expression data. The second section shows a table of statistics for two samples with gene or expression counts alongside other statistics.

Figure 27. Statistical comparison of two samples for a pathway

Standard metarep based pathway comparison is very interesting but it requires expert knowledge of a pathway to tell whether it is complete or functional in a sample. In simple terms we do not know whether we are able to find all critical genes required for a pathway in a sample or more simply what are those critical genes without which a pathway is inactive or incomplete. In order to distinguish between complete and incomplete pathway modules, we describe our implementation of an interactive pathway module completion test, see relevant section below.

Interactive KEGG Pathway Module Completion Test

Given expression or gene presence absence data for genes in a pathway for a genome or metagenome, it is not straightforward to mark active or inactive parts (e.g. modules) of a pathways due to expert knowledge required in identifying critical genes. Comparing multiple samples in order to see complete or incomplete modules is even more difficult. We developed an interactive heatmap test to study pathway module status (whether complete or incomplete) across a set of samples being compared, an example of such an interactive heatmap is shown in Figure 28 below.

Figure 28. An interactive heatmap showing Pathway module completion status

To construct such a heatmap, we make use of KO gene expression or presence absence information available from samples and KO genes based logical expression information available under definition section for modules in KEGG pathways. If a sample passes the logical expression test for a particular module, satisfying the requirement of critical or alternative KO genes as present in a sample, this module is marked as complete for sample under comparison, see M00048 module (Inosine monophosphate biosynthesis) example in Figure 28. A list of modules marked complete, in at least one of many samples being compared, is then presented as an interactive heatmap. Here, colored cells represent complete module while no color shows incomplete modules.

Clicking on any cell of such an interactive module completion heatmap shows a link to KEGG website, see module M00048 example in Figure 28. Clicking on this link (use right click and open link in new tab) performs mapping of KO genes present in corresponding sample on to the module in question at KEGG website, see Figure 29 for this example that explains how this module was considered complete or incomplete by highlighting KO genes from our sample in red. It shows without previous knowledge it is not easy to mark a KEGG module complete. If a KO id is not highlighted and it is the only KO id in a block of a module, e.g. K00764, this module is considered incomplete since it could not satisfy the requirement of such a critical gene. In other cases with multiple KO ids shown for a block in a module, presence of one of those KOs may satisfy completion of that block of the module e.g. the case of K01945 or K01923 or K00602 in Figure 29.

Figure 29. Example KEGG module completion highlight KO genes for selected sample

Access to interactive heatmap for pathway module completion test is available in compare page. To perform this test select a few samples and select KEGG Pathways (KO) page from the result panel section. For an example see Figure 25 where this test completed and a link (‘view KEGG (KO) module fulfillment’ highlighted in yellow) is shown to view interactive heatmap. In the background upon clicking KEGG Pathway (KO) tab, we obtain KO genes presence absence information for samples being compared and perform pathway module completion test.

 

Examples of Metagenomic Comparisons

A fundamental set of questions is asked about microbes while analyzing metagenomic samples, ‘Who are they?’ and ‘What exactly are they doing?’ translated into taxonomic and functional profiling of samples, respectively. Given above 1 trillion species estimated to be present on planet earth, huge studies and combined community efforts are needed to understand the secrets of microbial world. Earth microbiome project recently published taxonomic profiling of large number of metagenomic samples from 7 continents, 43 countries and 21 biomes with a next step to capture functional profiling of samples.

We proposed use of gene information tables containing taxonomic and extended functional annotations for every gene from a sample or a gene catalogue as this can be very useful for data exchange, conversion and comparisons across analysis platforms. KMAP is able to produce such gene information tables, termed AAMG TSV, that we also index for interactive visualization and comparison of samples through KMAP Compare Module.

Tips for using KMAP Compare Module

A recommended way of looking at gene catalogues, samples are sample groups datasets in KMAP Compare Module is dependent on the type of analysis in mind. For taxonomic profiling of samples, one should select or use one or more relevant single copy genes as a proxy for 16S genes. For functional role comparisons, one could select a filter e.g. KO genes and compare samples for pathways, gene ontology, kegg orthologs, etc. A functional role analysis obtained for a set of samples can be limited to certain taxonomic classes using e.g. to look at a functional analysis limited to only bacterial species, one should include a taxonomic filtering term e.g. blast_tree:2. Similarly, to look up a single or many genes of interest where e.g. KO id is known, one could use search terms ko_id:K00446 OR ko_id:K00448 OR ko_id:K00449. The nice feature of metarep style framework is that for the same data, it is possible to lookup taxonomic or extended functional annotation information just by clicking on the relevant tab in the compare pages.

Samples and Sample Groups

Sample groups from TARA and Human Gut Microbiome project

By expanding KMAP capability to use more computer nodes with higher memory, taking TARA gene catalog as a common reference alongside gene abundance estimates from each sample, we are able to provide all individual 243 TARA ocean samples online for exploration or comparisons through KMAP compare module, see project 117. To compare a reduced set of samples, we also grouped similar samples considering filter size, ocean type and sampling depth. Of the total provided 47 sample groups produced there are 24 sample groups from prokaryotic enriched (eProkaryote) samples, 13 sample groups from Girus (eGirus) enriched and 12 sample groups from Virus (eVirus) enriched samples.

Regarding Human Gut Integrated Gene Catalog (HG_ IGC) study, there were 1267 published samples for which we collected metadata to label sample names with disease status, geographical origin, gender, age, BMI values and original sample id. An example such sample name looks like Crohns_disease.E.Spanish.F.Age.25.BMI.18.ID.V1.CD1-0. Using portions of HG_IGC gene catalogue, individual samples based gene information tables were produced, based on gene abundance availability of its sample, and indexed all of these 1267 samples for comparison and exploration through KMAP compare module, see Project 118. Furthermore, to analyze reduced set of sample we created groups of samples by aggregating gene information table for unique set of genes belonging to devised sample groups. Sample groups were devised based on disease status, geographic origin and obesity status, as shown in this example sample group name, T2D_E_nonObese where E, A or C represent Europe, America or China. This strategy produced 25 sample groups, see project 115.

Examples from TARA ocean

First to look at the summary pages of TARA ocean gene catalogue, a snapshot of the interactive summary page from KMAP Compare Module is shown in Figure TARA-1. It shows different specific categories of information e.g. summary, taxonomy, kegg orthologues, pathways etc to look number of genes assigned to relevant categories, explore this and other gene catalogues through KMAP project 119, after clicking on the ‘public access’ button. This projects provides access to gene catalogs from  TARA microbial, TARA eukaryotic (TARAeu), Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) and for completeness we include Integrated Gut gene catalog from Human (Human_IGC). A snapshot of top 20 taxons from these gene catalogs, is shown in Figure TARA-2, considering the most abundant single copy gene applied with least common ancestor based taxonomic assignment. KMAP project 119 also contain several other interesting gene catalogs we constructed from ecological metagenomes downloaded through European Nucleotide Archive (ENA).

For comparison of samples or gene catalogs, KMAP provides all standard features of MetaRep. Additional features include interactive heatmaps to either compare pathway modules across samples with confirmed incomplete or complete status (see figure TARA-3) or to explore magnitude difference of unique gene counts or expression from samples being compared (see figures TARA-4,5 ).

Figure TARA-1. Summary of TARA gene catalog for taxonomy or functional role assignments.

To make data analysis interesting we provide several labels, called filters, to focus on one or more subsets of annotations at a time e.g. KO assigned genes or Single Copy Genes (SCGs), or Industrial Proteins of Interest (POIs) and many more. Filter information is stored in the gene information table and also available when exploring or comparing samples using KMAP Compare module, see figures TARA-2,4 where results are shown from selected SCG filter.

Figure TARA-2. Top 20 taxons detected in non-redundant gene catalogs from TARA microbial (TARA_GCv2), Human Gut Microbial (HG_IGCv2), TARA eukaryotic (TARAecCatV2) and Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) considering the least common ancestor approach and the most abundant single copy genes. For exploration or comparison see relevant gene catalogs in KMAP project 119.

 

By expanding KMAP capability to use more computer nodes with higher memory, taking TARA gene catalog as a common reference alongside gene abundance estimates from each sample, we are able to provide all individual 243 TARA ocean samples online for exploration or comparisons through KMAP compare module. The number of samples here is high so in order to study groups of samples, we also grouped non redundant unique gene counts for similar samples, considering filter size, ocean type and sampling depth. Of the total provided 47 sample groups there are 24 sample groups from prokaryotic enriched (eProkaryote) samples, 13 sample groups from Girus (eGirus) enriched and 13 sample groups from Virus (eVirus) enriched samples, see figures below for example taxonomic and functional role comparisons from TARA sample groups.

Figure TARA-3. Interactive (link) KEGG pathway module test for TARA Ocean sample groups

An interactive heatmap of all TARA sample groups with results from KEGG pathway modules completion test is shown in Figure TARA-3 where a colored cell represents a module is designated the status complete. Example section of this interactive heatmap shows complete or incomplete aromatics degradation modules from KEGG, follow the link provided to explore the status  and details of all the modules. A click on any cell of the heatmap triggers mapping of the underlying sample KO gene information to pathway module diagram at KEGG website leading to clear cut evidence explaining how a module of a pathway is considered complete or incomplete in our interactive heatmap.

Figure TARA-4. A heatmap taxonomic breakdown of enriched prokaryotic sample groups

Figure TARA-5. Interactive heatmap (link) showing fold change (of unique gene counts) for each binary sample comparison for a selected taxonomic level.

The above shown example snapshots from KAMP system entail how the complexity of metagenomic analysis for big studies like TARA ocean can be made slightly easier for non computational scientist with an encouragement to contribute, without worrying about required computational resources, towards a deeper understanding and benefits of the microbial world from planet earth. More true and success stories can be developed from such a huge resource of genetic information from diverse environments and above all everybody from scientific community can contribute using free of cost and online exploration of derived data of this scale.

   


Human Integrated Gut Microbiome.

Human Integrated Gut gene catalogue was established using ~1200 metagenomic samples from healthy subjects and others with certain diseases primarily type II diabetes. This gene catalogue comes with around 10 million unique genes obtained through assembly gene prediction and clustering of genes across 1267 samples covering europe, china and american geography. This is very interesting work with several papers coming out explaining different aspects of microbes on human health and disease and geographic patterns most probably related to diet preferences. A summary of taxonomic and functional role categories of all Human Gut genes is shown figure HG_IGC1. To interactively explore complete or incomplete modules from KEGG pathways in Human Gut sample groups (control vs disease, geographic origin, etc), a snapshot is shown with a link in figure HG_IGC2.

Figure HG_IGC-1. Summary of Human Integrated Gene Catalog (HG_IGC) showing proportion of unique genes available with  taxonomy or functional role assignments

Figure HG_IGC-2. Interactive (link) KEGG pathway module test for Human Gut sample groups

ENA ecological metagenomes.

European Nucleotide Archive (ENA) provides access to all public Nucleotide data alongside extensive metadata of samples. Metagenomic studies published were all based on best methods available of analysis at that time. Earlier studies were all based on direct shart reads based (mostly incomplete gene prediction) analysis however, with significant improvement in metagenomic assembly methods,  recent works are all based on assembled metagenomes leading to recovery of full length genes.

The task for revisiting all previous metagenomic samples, based on recent improvements in assembly methods and best practices in metagenomic analyses, is challenging considering the required computational resources. However, If accomplished, this could lead to an unbiased impressive pool of derived metagenomic data (including full length genes) from different environments of planet earth, making the task of comparative metagenomics and novel gene discovery relatively straightforward.

EBI metagenomics annotation pipeline processes data from ENA with their efficient methods splitting RNA based taxonomic profile and InterPro based functional profile of individual samples. When assembly of a metagenomic sample is available, without any competition EBI metagenomic analysis can be complemented for every predicted gene with features, in addition to InterPro domains, such as taxonomic affiliation from UniProtKB and functional role affiliation based on KEGG Orthology (KO) datasets and associated cross references e.g. see an example table below. Computationally demanding task but if such a gene table could be obtained for all metagenomic samples from planet earth, it can lead to unbiased comparative metagenomics and novel gene discovery.

Table: An example gene table for a sample showing extended annotation (category) columns.

 

ENA provides an efficient advanced search interface to lookup metagenomic samples based on the available metadata. In order to produce gene tables for ENA metagenomic samples, we conducted a pilot study. Here a number of samples were downloaded from ENA where sequencing platform was illumina paired end, protocol was WGS and taxonomic label of samples was ecological metagenomes.

Gene catalogues from ecological metagenomes  

A Gene catalogue, similar to ones from TARA and Human microbiome, serves as a common reference to compare large number of related samples. In our pilot study we construct gene catalogues, one for each of the types of ecological metagenomes obtained from ENA, see KMAP project 119. Individual samples are first preprocessed using bbduk and assembled using MegaHit assembler, followed by prediction of complete genes using Prodigal and clustering of genes into a non-redundant set using CD-Hit producing one gene catalogue for each type of ecological metagenome.  These gene catalogues are then annotated using high throughput implementation of BLAST and InterProscan in Automatic Annotation of Microbial or Metagenomes (AAMG) to produce gene information tables, called AAMG TSV gene tables, one for each gene catalogue. In order to produce gene abundance estimates and AAMG TSV per sample, paired end read from each sample are mapped using bbmap onto their respective ecological metagenome gene catalogues. A list of all gene catalogues included in KMAP is presented in a table in a section above, Gene Catalogues and samples in KMAP .

References

Alam, I., Antunes, A., Kamau, A.A., Kalkatawi, M., Stingl, U. and Bajic, V.B., 2013. INDIGO–INtegrated Data Warehouse of MIcrobial GenOmes with examples from the red sea extremophiles. PloS one, 8(12), p.e82210.

Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E. and Bremges, A., 2017. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature methods, 14(11), p.1063.

 

Goll, J., Rusch, D.B., Tanenbaum, D.M., Thiagarajan, M., Li, K., Methé, B.A. and Yooseph, S., 2010. METAREP: JCVI metagenomics reports—an open source tool for high-performance comparative metagenomics. Bioinformatics, 26(20), pp.2631-2632.

Schattner, P., Brooks, A.N. and Lowe, T.M., 2005. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic acids research, 33(suppl_2), pp.W686-W689.

Lagesen, K., Hallin, P., Rødland, E.A., Stærfeldt, H.H., Rognes, T. and Ussery, D.W., 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic acids research, 35(9), pp.3100-3108.

Huang, Y., Gilna, P. and Li, W., 2009. Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics, 25(10), pp.1338-1340.

Deu-Pons, J., Schroeder, M. P., & Lopez-Bigas, N. (2014). jHeatmap: an interactive heatmap viewer for the web. Bioinformatics, 30(12), 1757-1758.

Cole, J.R., Wang, Q., Fish, J.A., Chai, B., McGarrell, D.M., Sun, Y., Brown, C.T., Porras-Alfaro, A., Kuske, C.R. and Tiedje, J.M., 2013. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic acids research, 42(D1), pp.D633-D642.

Hyatt, D., Chen, G.L., LoCascio, P.F., Land, M.L., Larimer, F.W. and Hauser, L.J., 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), p.119.

Rho, M., Tang, H. and Ye, Y., 2010. FragGeneScan: predicting genes in short and error-prone reads. Nucleic acids research, 38(20), pp.e191-e191.

Zhu, W., Lomsadze, A. and Borodovsky, M., 2010. Ab initio gene identification in metagenomic sequences. Nucleic acids research, 38(12), pp.e132-e132.

Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G. and Pesseat, S., 2014. InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), pp.1236-1240.

Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J.R., Prifti, E., Nielsen, T. and Juncker, A.S., 2014. An integrated catalog of reference genes in the human gut microbiome. Nature biotechnology, 32(8), p.834.

Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E. and Bremges, A., 2017. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nature methods, 14(11), p.1063.

Sunagawa, S., Coelho, L.P., Chaffron, S., Kultima, J.R., Labadie, K., Salazar, G., Djahanschiri, B., Zeller, G., Mende, D.R., Alberti, A. and Cornejo-Castillo, F.M., 2015. Structure and function of the global ocean microbiome. Science, 348(6237), p.1261359.