MLST – MultiLocus Sequence Typing

Not a shiny new procedure, MLST has been around since 1998 [1].

MLST allows for the unambiguous characterization of isolates from infectious agents (predominantly bacteria, some fungi) using sequences of internal fragments of (usually) seven housekeeping genes. Gene regions of approximately 450–500 bp are sequenced and those found unique within a species are assigned an allele number. Each isolate is then characterized by the alleles at each of the seven loci, which constitute its allelic profile or sequence type (ST).

Each isolate of a species is therefore unambiguously characterised by a series of seven integers which correspond to the alleles at the seven house-keeping loci.

In MLST the number of nucleotide differences between alleles is ignored and sequences are given different allele numbers whether they differ at a single nucleotide site or at many sites. The rationale is that a single genetic event resulting in a new allele can occur by a point mutation (altering only a single nucleotide site), or by a recombinational replacement (that will often change multiple sites) – weighting according to the number of nucleotide differences between alleles would erroneously consider the allele to be more different than by treating the nucleotide changes as a single genetic event.

Most bacterial species have sufficient variation within house-keeping genes to provide many alleles per locus, allowing billions of distinct allelic profiles to be distinguished using seven house-keeping loci. For example, an average of 30 alleles per locus allows about 20 billion genotypes to be resolved.

The allelic profiles of isolates are then compared to existing databases.


  • MLST allows accurate assessment of species, and sometimes down to strain level.


  • Selection of housekeeping loci necessitates a reference genome.
  • You need a MLST database

Good reading:

[1] Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG. Multilocus sequence typing: a portable approach to the identification of clones within populations of
pathogenic microorganisms. Proc Natl Acad Sci U S A. 1998 Mar 17;95(6):3140-5.


[3] Pérez-Losada M, Cabezas P, Castro-Nallar E, Crandall KA. Pathogen typing in the genomics era: MLST and the future of molecular epidemiology. Infect Genet Evol. 2013 Jun;16:38-53.



  • The ever helpful tseemann has a nice MLST tool, handily called mlst.
  • Web tool at the CGE Server.
  • Links from this page:
  • SRS2

Genome assembly with “cheap” data

The current standard thinking is that in order to assemble a genome, you need lots of different libraries – e.g some short Illumina reads, some mate-pairs, and top it of with some PacBio long reads.  Then along came the Broad Institute who released DISCOVAR which only needs 250 base paired-end PCR-free Illumina reads, nothing else.  DISCOVAR has been around since 2013 but there are regular and important improvements to the algorithm.  DISCOVAR works by initial alignment of the reads to the genomic regions, followed by careful local assembly.  DISCOVAR is intended for assembly so accurate than SNPs and small structural variants can be called.

DISCOVAR de novo was released in July 2014. DISCOVAR de novo  can capture (mammalian levels of) polymorphism (note that DISCOVAR aims at draft genomes for low polymorphism species).

Good references:

  • PAG abstract
  • T. Sharpe, I. MacCallum, N.I. Weisenfeld, E.S. Lander, and D.J. Jaffe. Long-range genome reconstruction from short-range data. Submitted, 2015
  • Nature blog
  • Post from Nick Loman

Academia vs industry – one viewpoint

Free image from

I love Twitter – if you filter out the fluff and the kittens migrating over from the internet (although I am happy to look at a few kittens), then there is lots of cool stuff.  Twitter led me to BabyAttachMode, who has several blogs posts on moving from academia to industry.  She had a link to an interesting blog article on how working in industry can be like being in the belly of a large ship.  This inspired me to write about my job.  I work for a not-for-profit.  I am not in the belly of the ship, on the contrary I can sit on the bridge next to the captain if I like, and she (note SHE!), will chat to me about where the ship is going, what icebergs lay ahead and difficulties with the rudder.   Then I can go back to rest of the crew and we can talk about icebergs and big waves and new countries to be discovered.  I love it, working for a small company gives me the chance to have a crack at pretty much everything, I am offered self-development and growth, (Tony Robbins thinks things can’t get much better than this), it is inspiring and engaging.

More to life than 16S – fungi and ITS!

I wanted to write on the ITS target and came across a wonderful blog post written because the authors “wish to share our love of mycology with others!” – I love science and how it inspires people!

Their post raises a few interesting points:

  • Amplification of the ITS region is good for taxonomy, not so much for phylogeny
  • there are multiple, potentially different, copies of ITS per species, and different primers target slightly differently – certain species may be not be picked up
  • the de novo (or open) OTU picking strategy is most appropriate for fungal studies
  • ITS1 rather than ITS4 results in fewer chimeras
  • The software ITSx which I hadn’t heard of previously – perl ITSx –i infile –o outfile –t F – is my kind of command line!

The post generated some very interesting comments, including one on phylogenetic trees and QIIME.

Two papers are highlighted, the reference for ITSx, and Smith and Peay 2014. who use UPARSE to pick OTUs, which we already use, it is excellent!

Variation in RNAseq

The AGRF supports training in bioinformatics!

The AGRF produces a MASSIVE amount of RNAseq data and as well as qcing all of the data, the bioinformatics team also analyses a fair chunk of it.

One aspect of the analysis that must always be taken into account is variation within the data, particularly with the emerging technology of single cell sequencing. Some of this variation can be minimized, e.g. by using skilled and experienced personnel to do the RNA extraction and library preparation, use the same concentration of cDNA for sequencing per sample (very important!), choosing to sequence biological replicates (minimum three, especially for highly variable human samples), sequence as many of your samples of interest in one go (if possible!), and by reducing the possibility of lane bias artifacts by indexing and splitting samples across multiple lanes.

There are some very nice papers that address variation in RNA-seq, e.g. RNA-seq : technical variability and sampling, and for *big* studies – Detecting and correcting systematic variation in large-scale RNA sequencing data. There are a number of bioinformatic ways to address the numerous sources of technical variation within an RNA-seq library, for example by constructing a linear model that allows for samples that have been sequenced more than once, or by using a specific method such as RUV – removing unwanted variation, handily implemented in Bioconductor,

GBS – the solution to working with large or complex genomes.

There are a number of situations where standard next generation sequencing is not currently living up to its potential.  Species which have large complex genomes, e.g. plants which often have abundant sequence repeats and genome duplications.

Therefore a number of techniques have been developed to reduce genome complexity, for example by filtering out highly repetitive regions. Methods to reduce genome complexity using restriction enzyme digests are based on digestion of genomic DNA by restriction enzymes and include RAD-seq and Genotyping by Sequencing (GBS) (similar to RAD-seq, but with fewer steps). GBS allows breeders to use genomic selection without developing any prior molecular tools such as markers and maps.

There are several advantages to using the GBS method:

  • There is no random shearing or size selection.
  • It is an inexpensive method
  • Many samples can be studied simultaneously

GBS uses a simple bar-coding system that adds a short stretch of DNA sequence to one of the sequencing adaptors that is ligated to each DNA fragment. The length of the barcode sequence is varied to ensure that the first 12 bp contains random sequence, which is a requirement for the sequencing software. If barcode lengths were not modulated, all sequences would contain an identical stretch of nucleotides at the same position, corresponding to the restriction enzyme recognition site. Control over complexity reduction is achieved by choosing enzymes that cut at different frequencies in the genome. The choice will ultimately depend on the application and the properties of the genome being studied. The AGRF has evaluated several enzymes and offers GBS as a complete package – from raw sample submission to a final bioinformatic report.

Studying microbial communities – post 1 of 2

There has been a lot of chatter in the blogosphere about metagenomics vs 16S.  The AGRF handles both and we provide summaries of both approaches to help our clients achieve their goals.

Metagenomics is the study of the collective genomes of the members of a microbial community.

Firstly the sequencing must be appropriate to enable genome assembly – HiSeq (the platform of choice for metagenomics), long reads, paired end. High-complexity metagenomes need extremely high coverage (billions of reads). The multiple genomes of the various species are then assembled (de novo metagenome assembly). If for some reason you are unable to assemble the reads, then they can be aligned against 16S databases to give a good approximation of the community content.

Tools commonly used include MetaVelvet, EBI Metagenomics and MG-RAST.