How to Install and Run the PATRIC AdaBoost Classification Software

Getting Started

This installation guide will walk you through the necessary steps for getting started with our code for using AdaBoost to generate AMR classifiers¹. This software requires the k-mer counting code KMC², so we will walk through installing this as well. Please cite both projects in your work.

There are some general considerations if you wish to use our code for building AMR classifiers. First, we coded AdaBoost from scratch and it is a bespoke implementation that is designed for finding the k-mers that are indicative of resistance. Second, the code is hardwired to classify based on the presence (1) or absence (0) of a k-mer in a genome and classifications based on raw counts are not supported. Finally, we do not look for or vote based upon the highly ranking k-mers that may be signatures of "susceptibility" and the code does not return these. If you want a more general version of AdaBoost or one that can do multiclass classification, we recommend using the python sci-kit-learn package.

We recommend running the software with at least 100 susceptible and 100 resistant genomes. It only works on fewer genomes when the AMR mechanism is easy to find. We also recommend balancing the sets so that you classify with the same number of resistant and susceptible genomes. Finally, if you use a lot of genomes, it probably won't work on a laptop because the memory usage gets pretty high.

Installing the Software

1. You are going to need g++ to compile the code described below. We highly recommend gcc-4.9.3, and do not know if you will be successful using other versions. You need at least 4.9 to install KMC described below. Next, if you don't already have it, you must download and install boost from http://www.boost.org. This is a package of C++ libraries. If you don't have these set up already, and you are feeling timid, you might want to consult your local computer guru.

2. Next, you should create a directory to house the k-mer code and KMC. Navigate to that directory. I made a directory called KMER_Install and entered that directory:

$ mkdir KMER_Install

$ cd KMER_Install/

3. Our software is called kmerge and it builds a matrix of k-mer counts by reading a set of precomputed k-mer database files that are generated with the KMC software package. We will download KMC first by getting it from github. If you don't have github set up, it's easy and worthwhile: https://help.github.com/articles/set-up-git/. Otherwise you can download manually from the webpage for each repo. If you have git set up, type:

$ git clone https://github.com/refresh-bio/KMC.git

This will clone all of the KMC files from github in a directory called KMC. Enter that directory and compile the files by typing make.

$ cd KMC/

$ make

If this fails you need to edit the "makefile" file and change it to match your current installation of g++. To do this, you would change:

CC = g++

CC = path to your g++

If you are trying to install on a mac, there are special instructions on the KMC github page that you should follow.

When this finishes you will see running versions of kmc and kmc_dump in the bin directory.

4. Next we will traverse back to our KMER_Install/ directory by typing:

$ cd ../

We will then clone our k-mer code from the SEED repo:

$ git clone https://github.com/TheSEED/close_kmers.git

If you do a ls, you will see the KMC directory and the close_kmers directory in your KMER_Install directory. This repo contains the relevant code "kmerge", as well as a bunch of other code that you get for free, which we won't describe here.

$ ls

close_kmers KMC

You will need to enter the close_kmers directory. And edit the "Makefile". You will need to change the "BUILD_TOOLS" and "BOOST" variables to match the paths to your installations of g++ and boost. You will see that ours installations are are idiosyncratic to the PATRIC runtime environment. If you already had kmc installed and the "KMC" directory it is not immediately adjacent to the "close_kmers" directory, you will also need to change the variables "KMC_DIR", "KMC_LIB", and "KMC_INC" in the Makefile to suit your needs. Once you have done this type:

$ make

$ make kmerge

The first command compiles the software in the directory, The second command compiles kmerge specifically. You can double check that kmerge is running by typing:

$ ./kmerge –h

If it compiled, you should get a help statement that looks like this:

Usage: ./kmerge [options] resistant-file susceptible-file

Allowed options:

-h [ --help ] show this help message

--use-kmer-counts Use kmer counts in the matrix instead of

booleans. Disables the inversions used in

boolean tables.

-r [ --rounds ] arg (=10) Number of rounds of Adaboost to run

-a [ --adaboost ] Run Adaboost on the binary matrix

--no-header Do not write the 'labels' header line

--max-files arg (=-1) Max number of files to process per data set

-d [ --kmer-dir ] arg (=KMERS) Directory in which kmer files are found

-o [ --output-file ] arg Write output to this file instead of stdout

--resistant-file arg File containing list of kmer files for

resistant genomes

--susceptible-file arg File containing list of kmer files for

susceptible genomes

5. Finally, you will want to add kmerge, and the kmc programs to your path.

Running the Software

In this section we will do a very brief demonstration of how to run the kmerge software to generate an AdaBoost file. We are going to use a very small number of genomes in this example just to run the code, so don't get too excited if the k-mers end up being meaningless. We will download the genomes and metadata from PATRIC.

We need to have a set of resistant and susceptible genomes. We currently keep a file of every genome with AMR metadata on the PATRIC ftp server. I am going to curl this file in order to download it and write it to a file called, "PATRIC_genomes_AMR.txt".

$ curl ftp://ftp.patricbrc.org/patric2/current_release/RELEASE_NOTES/PATRIC_genomes_AMR.txt > PATRIC_genomes_AMR.txt

At the time of writing this file has ~83,000 lines in it. If you look at the file, you will see that for each genome, it has a line for each antibiotic and the metadata associated with each antibiotic. For this demonstration, I will grab 10 Acinetobacter baumannii genomes that are carbapenem resistant and 10 Acinetobacter baumannii genomes that are carbapenem susceptible. First I will make a file with a list of 10 resistant genomes.

$ grep "Acinetobacter baumannii" PATRIC_genomes_AMR.txt |grep "carbapenem" |grep "Resistant" | head >AB.resistant.list

Next I will make a list with 10 susceptible genomes.

$ grep "Acinetobacter baumannii" PATRIC_genomes_AMR.txt |grep "carbapenem" |grep "Susceptible" | head >AB.susceptible.list

Now we need to download the genomes in these lists, so I will create a directory called "Contigs" and add each genome to the directory. I will do this using a perl one-liner.

$ mkdir Contigs

$ cut -f1 AB.resistant.list | perl -e 'while (<>){chomp; system "curl ftp\:\/\/ftp\.patricbrc\.org\/patric2\/patric3\/genomes\/$_\/$_\.fna >Contigs/$_.contig";}'

In the above example, the one-liner is simply pulling the first column of the AB.resistant.list and curling the corresponding fna file from the PATRIC ftp site and adding that file to our contigs directory with the name, "id.contig". We will do the same thing to get the susceptible genomes.

$ cut -f1 AB.susceptible.list | perl -e 'while (<>){chomp; system "curl ftp\:\/\/ftp\.patricbrc\.org\/patric2\/patric3\/genomes\/$_\/$_\.fna >Contigs/$_.contig";}'

Now we have 20 contig files in the Contigs directory.

The next step is to convert the contigs into k-mer files using kmc. First we will make a directory called KMERS.

$ mkdir KMERS

I am going to wrap the kmc call "kmc -k15 -fm -cs16777215 -ci1 Contigs/Genome ID.contig KMERS/Genome ID KMERS" within another perl script. Note that we are generating 15-mers and that we are keeping singleton k-mers since we are working with assembled genomes. This will take a few minutes. The command line options for kmc can be tricky, and we would refer you to the user manual to master them.

$ ls Contigs | perl -e 'while (<>){chomp; s/\.contig//g; system "kmc -k15 -fm -cs16777215 -ci1 Contigs/$_.contig KMERS/$_ KMERS";}'

The above script is listing each file in the contigs directory with ls. Then the perl script is striping the ".contigs" suffix from each file name and submitting the genome id to kmc and putting the output in the KMERS directory. You will have 40 files in your KMERS directory one with a ".kmc_suf and one with a ".kmc_pre" suffix for each genome. These outputs are the binary kmc output format, so you won't be able to look at them. If you want human-readable output you should run the kmc_dump command.

Now it's time to start working with kmerge. First we want to create a log file of just the genome identifiers for each genome. We will simply cut the first column of our list files to do this.

$ cut -f1 AB.resistant.list >AB.R.log

$ cut -f1 AB.susceptible.list >AB.S.log

We then launch the kmerge job as follows:

kmerge -a -r 10 -d KMERS --resistant-file AB.R.log --susceptible-file AB.S.log >AB.adaboost

The –a option tells kmerge to run AdaBoost. The –r option declares the number of rounds of boosting. We always use 10 rounds of boosting because the alpha value drops pretty quickly when you have a good classifier.

You can use the other program options to return a matrix, which you may find useful in other circumstances. Note that our matrix will be inverted relative to what you might need for sci-kit. That is, the k-mers are the rows and the genomes are the columns. This is a hold over from our original perl-based implementation of this algorithm. In that implementation we sorted the lists of k-mers to make the merging step more efficient.

The program will print a standard error message with its status so you can follow your progress. The resulting AdaBoost file will be tab-delimited and will look like this:

$ cut -f1,2,3 AB.adaboost

0.15 0.867301 GGAGCAACATATGCA

0.22549 0.616977 AGGTTGGTAGATATC

0.259494 0.524301 AAAACTTTTTATCAC

0.300818 0.421704 ATCCCGCCAGTAAGG

0.467088 0.0659191 TAGTGCTGATTCAGA

0.50154 0.00307911 ATTGATTCTTCGCTA

0.511655 0.0233132 CTTGTGCCGAAGGCC

0.522375 0.0447806 GAGCAACATATGCAC

0.539183 0.0785278 CAACATACGCACTAC

0.572746 0.146531 CCCATCATATTCGCC

The first column is a weight value, this is used internally, and you probably won't need it. The second value is the alpha value. This is what is used for voting. The remaining columns are the significant k-mers.

This was a particularly bad example because we used too few genomes, so the file is hard to look at. Regardless, once you get up near 100 genomes of each, you will see legible results. In general, assuming that the experiment is set up correctly, a higher AdaBoost value usually indicates a better the classifier.

Finally, in order to classify a genome, you go through each line of the AdaBoost file asking whether the target genome has one of the k-mers from that line. If it has at least one, your score is += the AdaBoost value for that line. You only count one vote per line. If your genome does not have any of the k-mers in a given line of the AdaBoost file, it gets a count that is –= the AdaBoost value. Again, you only count once. After you read through all 10 lines of the file (or however many lines you choose) if the score is (+), the genome is predicted to be resistant, and if it is (–) the genome is predicted to be susceptible.

References

1 Davis, J. J. et al. Antimicrobial resistance prediction in PATRIC and RAST. Scientific reports 6 (2016).

2 Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics 31, 1569-1576 (2015).