2011 Greengenes Taxonomy

In 2011 the 16S rRNA gene database, Greengenes, announced its updated taxonomy via articles in the Nature Publishing Group journal, ISME J. Second Genome has adopted this taxonomy for annotating sequence data and probe hybridization data. If you would like to learn more about this taxonomy or would like to create a reference set on your laptop or compute cluster, then you have found the right web page. If you need help with a custom annotation or classification project let us know.

2011 Greengenes Papers

An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea., Daniel McDonald, Morgan N Price, Julia Goodrich, Eric P Nawrocki, Todd Z DeSantis, Alexander Probst, Gary L Andersen, Rob Knight and Philip Hugenholtz, ISME J.

The article is open access and can be downloaded from:

The 2011 Greengenes taxonomy has been shown to yield systematic improvements over other popular reference sets for classification of 16S rRNA gene sequences collected from environmental and clinical samples as shown in the related paper, Impact of training sets on classification of high-throughput bacterial 16S rRNA gene surveys. Jeffrey J Werner, Omry Koren, Philip Hugenholtz, Todd Z DeSantis, William A Walters, J Gregory Caporaso, Largus T Angenent, Rob Knight and Ruth E Ley, 2011, ISME J. Nature.com has not labled this article as open access so feel free to contact us if you would like a pdf version.

2011 Greengenes Newick Trees

  • 16S_all_gg_2011_1.tree.gz All leaves correspond to sequences greater than or equal to 1,250 bases.
  • 16S_candiv_gg_2011_1.tree.gz All leaves correspond to sequences over 1250 bases, or shorter sequences comprising some candidate divisions where long sequences are yet-to-be discovered.

2011 Greengenes Taxonomy Tables

2011 Greengenes Fasta Files

  • sequences_16S_all_gg_2011_1_unaligned.fasta.gz These sequence records correspond only to sequences greater than or equal to 1,250 bases.
  • Advanced Users Only: sequences_16S_gg_2011_1.sel4cni.inf.aln.masked.fasta.gz This is the infernal alignment of only the 1,281 conserved positions that fit into infernal's secondary structure model. In other words, the sequences do not contain the same strings as the corresponding records in the fasta files above. This data file should not be used for BLAST databases nor Bayesian classifiers. It can be used as an input to tree construction programs.

2011 Greengenes mothur-ready files

  • GG98_FL.fna.gz Contains representatives from 98% identical clusters.
  • GG98_FL.taxonomy.gz Contains corresponding taxonomic strings for each sequence in GG98_FL.fna. Some nomenclature has been normalized from the original publication (above).