Foundations of AI and ML

Content

Lisam

Course materials

/

Section

Classifying Horseshoe Bats

Introduction

A team of researchers have dedicated their time to studying bats in southern Europe. They are especially interested in horseshoe bats, a genus of insect eating bats that owe their name to the particular shape of their noses. The researchers have gathered the following data during their time in the field

  • Species (Accuminate/Blasius’s/Guineau)
  • Wingspan [mm]
  • Tail area [mm]
  • Echolocation frequency [Hz]
  • Color of fur (Light brown/Dark brown/Red)
  • Forearm length [mm]
  • Body mass [g]

In total, the research team has observed 1,399 bats.

The team hires you to analyze the data that they have collected. You start your analysis by identifying the properties of the variables in the dataset.

Question A: Variable Types

Which of the variables in the list above are numerical?

Question B: Model Type

The research team is interested in the differences in attributes between the horseshoe bat species. They wish to build a model that can be used to predict the species of a horseshoe bat based on its forearm length and body mass. You are in charge of finding a suitable model.

Would you in this case build a regression or a classification model?

Question C: Choosing $k$

We will now apply the k-NN method to the problem introduced above.

Below, the k-NN decision boundaries obtained with the bat training data are visualised for two distinct values of $k$. Based on the figures, which value of $k$ is most suitable?

k-NN decision boundaries using $k = 3$
k-NN decision boundaries using $k = 3$
k-NN decision boundaries using $k = 25$
k-NN decision boundaries using $k = 25$

Select $k$:

Arrays

It is common to store data in arrays. We can store a dataset of the type considered above in a two-dimensional array, where each row is an observation and each column is a feature. If we have a dataset with $n$ observations and $p$ features, we will get a data array of size $n \times p$.

1-dimensional array
1-dimensional array

2-dimensional array
2-dimensional array

In python, we can realize arrays using numpy. We access an element in a numpy array by using its row and column indices. For example, if we want to access the $j^{th}$ feature of the $i^{th}$ observation in our data array D, we would access it by D[i, j]. We use ”:” to access a full row or column. The command D[:, j] would give us a one-dimensional array of the $j^{th}$ feature for all of the observations in the original array. Note that python uses zero-based indexing, meaning that the first element in an array is located at index 0.

We can use numpy to create arrays of higher dimension than two. We access elements in such multi-dimensional arrays similarly to how we access them in a two-dimensional array. For instance, we would access the element at position $(i, j, k)$ in a three-dimensional array D by D[i, j, k]. Sometimes the word tensor is used interchangeably to refer to multi-dimensional arrays, although the formal definitions differ.

In numpy, we can refer to the dimensions of an array using axes. A two-dimensional array has two axes, where axis 0 runs along the rows and axis 1 along the columns. A three-dimensional array has three axes, as illustrated in the figure below.

3-dimensional array
3-dimensional array

Element-wise operations

An operation between two arrays A and B is said to be element-wise if it is performed separately for every entry in the arrays. Let $\mathbf{A}$ and $\mathbf{B}$ be two-dimensional. The element-wise operation $\mathbf{A} + \mathbf{B}$ would for every pair $i,j$ sum the element at position $(i, j)$ in array $\mathbf{A}$ with the element in the same position in array $\mathbf{B}$ according to $(\mathbf{A}+\mathbf{B})_{ij}=\mathbf{A}_{ij}+\mathbf{B}_{ij}$. In the example below, both $\mathbf{A}$ and $\mathbf{B}$ have shape $2\times 2$.

Example of element-wise sum
Example of element-wise sum

In general, element-wise operations will not work unless arrays $\mathbf{A}$ and $\mathbf{B}$ have the same shape. However, in some cases when the shapes are not the same, numpy can broadcast the smaller array to match the size of the larger array. An example for when $\mathbf{A}$ is of size $2\times 2$ and $\mathbf{B}$ of size $1\times 2$ is shown below.

Example of array broadcasting
Example of array broadcasting

Notice that the first element of $\mathbf{B}$ is added to all of the elements in the first column of $\mathbf{A}$ and that the second element of $\mathbf{B}$ is added to all of the elements in the second column of $\mathbf{A}$.

Question D: New Observation

One day, the research team gets a call from a fellow colleague that is currently on a research station not far from where the team operates. She has observed a horseshoe bat nearby the station and she can not decide what species it belongs to. Luckily, your model is ready for use. The bat in question has a forearm length of 44.5 mm and a body mass of 10.3 g.

Fill in the missing lines in the code below to predict the species of the newly observed bat. Specifically, you need to compute the distance between the input vector of the newly observed bat, and those of all the training data points, in order to determine which of the training data points that are the nearest neighbors. The common notion of distance (that is, the length of a straight line between two points) is in mathematical terms referred to as the Euclidean distance. To compute it in the code below, it is convenient to use the broadcasting feature of arrays as described above, and functions for computing the square (**2) and square root (np.sqrt) of all elements of an array, as well as summing (np.sum) the elements of an array. For the last function, the axis parameter can be used to sum the elements of an array along a certain axis. As an example, np.sum(A, axis=0) will sum the elements of array A along the first axis.

Use the value of $k$ that you found to be suitable in the previous question.

Question E: Test Accuracy

You have now used your model to predict the species of a bat, but how do you know if your model is trustworthy? To evaluate your k-NN model, you can make predictions for a full new set of observed bats, that have not been included in the model itself, and for which you already know the true species. This type of testing will give you an idea of how well the model is expected to perform when put in production. We will return to the process of model evaluation and study it in more detail in section 4, but it is useful to get acquainted with the idea already now.

Based on this idea, there are several evaluation metrics that you can use to evaluate your model. One commonly used evaluation metric is accuracy, which is equal to the proportion of correct predictions among the new set of observations $\left\{\mathbf{x}_i, y_i \right\}_{i=1}^{n_t}$.

Here, we have added a subscript $t$ on $n_t$ to indicate that this is the number of test data points. Recall that the indicator function $\mathbb{I}$ is equal to 1 if the statement is true, and zero otherwise.

Finalize the code below to calculate the accuracy of your model using the test data set of the remaining 269 bat observations that were not used for training.

Conclusion

Well done, you have now implemented and evaluated a k-NN model for classification and used it to predict the species of a horseshoe bat! In the next example, you will see how the k-NN model can be applied to a regression task.

This webpage contains the course materials for the course TDDE56 Foundations of AI and Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2022 Linköping University