---
title: "How to train your model"
author: "Legana Fingerhut"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to train your model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE
)
```


By default `ampir` predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a *trained SVM model* with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the `predict_amps` function there is a `model` argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.

This vignette will go through a mock example of how you can train your own model using the `caret` package. For more information on how to use `caret` and the functions used within this example, please see the [extensive documentation](http://topepo.github.io/caret/index.html) made by the author, Max Kuhn. 

### Step 1: Obtain input data

First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences.
Both datasets were obtained from [UniProt](https://www.uniprot.org/):

  - positive by using the search term keyword:antimicrobial taxonomy:"Chiroptera [9397]" 
  - negative by using the search term taxonomy:"Chiroptera [9397]" 
  

For the positive dataset:

 - read data
 - add "positive" label
 - remove non standard amino acids
 
```{r}
library(ampir)
```


```{r}
bat_pos <- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir"))
bat_pos$Label <- "Positive"
bat_pos <- remove_nonstandard_aa(bat_pos)
```

For the negative dataset:

  - read data
  - add "negative" lavel
  - remove non standard amino acids
  - remove sequences (if any) that are also present in the positive dataset
  - randomly select the same number of sequences that are in the positive dataset

```{r}
bat_neg <- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir"))
bat_neg$Label <- "Negative"
bat_neg <- remove_nonstandard_aa(bat_neg)
bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,]
bat_neg <- bat_neg[sample(nrow(bat_neg),78),]
```

Combine the positive and negative dataset
```{r}
bats <- rbind(bat_pos, bat_neg)
```

Calculate features on the combined positive and negative dataset and add the label column
```{r}
bats_features <- calculate_features(bats)
bats_features$Label <- as.factor(bats$Label)
rownames(bats_features) <- NULL
```

```{r}
library(caret)
```

Split feature set data and create train and test set with `caret`

```{r}
trainIndex <-createDataPartition(y=bats_features$Label, p=.7, list = FALSE)
bats_featuresTrain <-bats_features[trainIndex,]
bats_featuresTest <-bats_features[-trainIndex,]
```

Resample method using repeated cross validation and adding in a probability calculation with `caret`

```{r}
trctrl_prob <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
                            classProbs = TRUE)
```

Train model using a support vector machine with radial kernel with `caret`.
*Note: Other classification models are supported too. For example, to use a [random forest model in `caret`](https://topepo.github.io/caret/train-models-by-tag.html#random-forest), `method` could be changed from "svmRadial" to "ranger".* 

```{r}
my_bat_svm_model <- train(Label~.,
                       data = bats_featuresTrain[,-1], # excluding seq_name column
                       method="svmRadial",
                       trControl = trctrl_prob,
                       preProcess = c("center", "scale"))
```

Test model to get an indication of how well the model performs on test data with `caret`

```{r}
my_bat_pred <- predict(my_bat_svm_model, bats_featuresTest)
cm <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive")
```

*Subset from `cm$byClass`*

| `r names(cm$byClass[11])` | `r names(cm$byClass[5])` | `r names(cm$byClass[6])`| `r names(cm$byClass[7])` |
| :-------------: | :-------------: | :-------------: | :-------------: |
| `r round(cm$byClass[[11]], digits = 2)` |  `r round(cm$byClass[[5]], digits = 2)` | `r round(cm$byClass[[6]], digits = 2)` | `r round(cm$byClass[[7]], digits = 2)` |

Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for `ampir`
```{r}
bat_test_set <- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3]
rownames(bat_test_set) <- NULL
```

Use the trained bat model in `ampir`'s `predict_amps` function on the bat test set

```{r}
my_bat_AMPs <- predict_amps(bat_test_set, min_len = 5, model = my_bat_svm_model)
```

*`my_bat_AMPs` sample*

```{r, echo=FALSE}
my_bat_AMPs$seq_aa <- paste(substring(my_bat_AMPs$seq_aa,1,10),"...",sep="")
my_bat_AMPs$seq_name <- paste(substring(my_bat_AMPs$seq_name,4,9),"...",sep="")
my_bat_AMPs$prob_AMP <- round(my_bat_AMPs$prob_AMP, digits = 3)
my_bat_AMPs <- my_bat_AMPs[c(1:3,44:46),]

knitr::kable(my_bat_AMPs)
```