--- title: "How to train your model" author: "Legana Fingerhut" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How to train your model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE ) ``` By default `ampir` predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a *trained SVM model* with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the `predict_amps` function there is a `model` argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa. This vignette will go through a mock example of how you can train your own model using the `caret` package. For more information on how to use `caret` and the functions used within this example, please see the [extensive documentation](http://topepo.github.io/caret/index.html) made by the author, Max Kuhn. ### Step 1: Obtain input data First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from [UniProt](https://www.uniprot.org/): - positive by using the search term keyword:antimicrobial taxonomy:"Chiroptera [9397]" - negative by using the search term taxonomy:"Chiroptera [9397]" For the positive dataset: - read data - add "positive" label - remove non standard amino acids ```{r} library(ampir) ``` ```{r} bat_pos <- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir")) bat_pos$Label <- "Positive" bat_pos <- remove_nonstandard_aa(bat_pos) ``` For the negative dataset: - read data - add "negative" lavel - remove non standard amino acids - remove sequences (if any) that are also present in the positive dataset - randomly select the same number of sequences that are in the positive dataset ```{r} bat_neg <- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir")) bat_neg$Label <- "Negative" bat_neg <- remove_nonstandard_aa(bat_neg) bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,] bat_neg <- bat_neg[sample(nrow(bat_neg),78),] ``` Combine the positive and negative dataset ```{r} bats <- rbind(bat_pos, bat_neg) ``` Calculate features on the combined positive and negative dataset and add the label column ```{r} bats_features <- calculate_features(bats) bats_features$Label <- as.factor(bats$Label) rownames(bats_features) <- NULL ``` ```{r} library(caret) ``` Split feature set data and create train and test set with `caret` ```{r} trainIndex <-createDataPartition(y=bats_features$Label, p=.7, list = FALSE) bats_featuresTrain <-bats_features[trainIndex,] bats_featuresTest <-bats_features[-trainIndex,] ``` Resample method using repeated cross validation and adding in a probability calculation with `caret` ```{r} trctrl_prob <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE) ``` Train model using a support vector machine with radial kernel with `caret`. *Note: Other classification models are supported too. For example, to use a [random forest model in `caret`](https://topepo.github.io/caret/train-models-by-tag.html#random-forest), `method` could be changed from "svmRadial" to "ranger".* ```{r} my_bat_svm_model <- train(Label~., data = bats_featuresTrain[,-1], # excluding seq_name column method="svmRadial", trControl = trctrl_prob, preProcess = c("center", "scale")) ``` Test model to get an indication of how well the model performs on test data with `caret` ```{r} my_bat_pred <- predict(my_bat_svm_model, bats_featuresTest) cm <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive") ``` *Subset from `cm$byClass`* | `r names(cm$byClass[11])` | `r names(cm$byClass[5])` | `r names(cm$byClass[6])`| `r names(cm$byClass[7])` | | :-------------: | :-------------: | :-------------: | :-------------: | | `r round(cm$byClass[[11]], digits = 2)` | `r round(cm$byClass[[5]], digits = 2)` | `r round(cm$byClass[[6]], digits = 2)` | `r round(cm$byClass[[7]], digits = 2)` | Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for `ampir` ```{r} bat_test_set <- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3] rownames(bat_test_set) <- NULL ``` Use the trained bat model in `ampir`'s `predict_amps` function on the bat test set ```{r} my_bat_AMPs <- predict_amps(bat_test_set, min_len = 5, model = my_bat_svm_model) ``` *`my_bat_AMPs` sample* ```{r, echo=FALSE} my_bat_AMPs$seq_aa <- paste(substring(my_bat_AMPs$seq_aa,1,10),"...",sep="") my_bat_AMPs$seq_name <- paste(substring(my_bat_AMPs$seq_name,4,9),"...",sep="") my_bat_AMPs$prob_AMP <- round(my_bat_AMPs$prob_AMP, digits = 3) my_bat_AMPs <- my_bat_AMPs[c(1:3,44:46),] knitr::kable(my_bat_AMPs) ```