Foundations of AI and ML
Content
Lisam
Course materials
/
Section
Classifying Traffic Signs
Introduction
Autonomous vehicles have the potential to change human transportation in many ways. We can not know the exact effects of completely replacing human drivers with autonomous vehicles, but there is large potential to increase road safety and reduce the environmental impact. Autonomous vehicles could additionally make life easier for people who can not drive today, such as the elderly or people with disabilities.
Building fully autonomous cars is however a very hard task. Scientists and engineers have worked on these types of systems for at least 50 years. Many great advances in the areas of sensor systems and software have lately sped up the progress, resulting in the first large scale tests of autonomous fleets on public roads. While great progress has been made, there still remains a lot of research and development until autonomous vehicles will be a common sight on the roads. In particular, guaranteeing the safety of these types of systems can be very hard. At the same time, doing so is an absolute necessity if they are ever to see widespread usage.
Together with hardware advances, modern machine learning techniques (read: deep learning) are great accelerators to autonomous driving technology. Deep learning enables the on-board computers of vehicles to handle the large amount of data coming from sensors. Neural networks are also flexible enough models that they can learn very complex patterns in the data, the kind of patterns that are necessary in order for a computer system to navigate a busy street.
For a computer system to navigate in traffic, many different tasks have to be solved at the same time. Some form of machine learning often has a role in most of these sub-systems. Examples include:
- Determining the position of the car on the road.
- Classifying objects around the vehicle (such as other vehicles, buildings, pedestrians, animals, etc.).
- Predicting the future trajectories of other vehicles and pedestrians.
- Finding a fast path to the destination.
- Detecting and classifying road signs.
In this task we will give a small taste of how one can approach the last problem: classifying road signs. Using a large dataset of images of road signs we will train a Convolutional Neural Network (CNN) classifier to solve this task.
Before we introduce the dataset and the model, let’s remind ourselves how CNN layers work.
Question A: Convolutions
Below you will find the input, filter and resulting output of a CNN layer. These are plotted both as just grayscale images and with the exact pixel values written out. For the pixels on the very edge of the output image, any coordinates outside of the input image are assumed to be 0 (zero-padding). Some of the values in the output are missing. Compute the output values for pixels a, b and c using the given input and convolutional filter.
What are the values of pixels a, b and c?
Question B: Output Shapes
Often we do not just want to work with greyscale images. For a color image, an additional channel dimension is introduced to the data. A typical image has 3 channels (red, green and blue). The number of channels can however be changed throughout the network layers. It is a common practice when designing CNNs to reduce the spatial dimension (height and width of the image) and increase the number of channels closer to the output-side of the network. When designing these types of networks, it is useful to be able to compute the shapes of the input and output to each layer. We will therefore practice this now.
Use the convention of writing channels last and assume that we always zero-pad the input. Let the input image be of shape $(9 \times 9 \times 3)$. We collect all filters into a weight tensor of shape $(5 \times 5 \times 3 \times 10)$.
Assuming that we write the ouput shape according to $(H \times W \times C)$, what is the number of channels ($C$) of the output if we use stride $(1 \times 1)$?
Question B cont'd
Using the same filter and input, what is the width ($W$) of the output if we use stride $(2, 2)$?
Implementing the CNN
Now that we have refreshed our memory about CNNs it is time to tackle the traffic sign classification problem. For this we will use the German Traffic Sign Recognition Benchmark (GTSRB) dataset. This dataset was originally created for an academic competition in 2011, where researchers competed to design the best model for classifying the images (Stallkamp et al., 2012). Now you will tackle the same challenge! This dataset contains in total 51 839 images of German traffic signs. There are 43 different classes of signs.
To construct a CNN we will need some more tools than what is implemented in the scikit-learn neural network module. We will therefore make use of the Keras deep learning library. It is a high level library that includes functionality to build and train a vast number of different neural networks. For this exercise you will not have to edit the code, but you will have to go through and execute it. While you do not have to delve too deep into Keras, you might find it useful to read parts of the documentation in order to understand the code you will be using.
Hardware for training neural networks
Training large neural networks, in particular CNNs, requires large amounts of computational power. Modern central processing units (CPUs) have become very fast and include multiple cores. Still, we often want to scale up our models to sizes that make them infeasible to train even when using the fastest CPUs on the market.
Most operations involved in training neural networks are performed on large matrices. These types of operations can be massively parallelized. This insight has led to a widespread practice of using Graphics Processing Units (GPUs) for neural network training. GPUs are designed to perform a large number of operations in parallel, which becomes immensely useful for neural network training.
In later years special hardware units have also been designed, tailored for deep learning. An example of this is the Tensor Processing Unit designed by Google.
Deep learning libraries
The neural network module in scikit-learn can be very useful for many standard applications of neural networks. It does however come with a few limitations (as of version 1.0):
- Only typical feed-forward neural networks are implemented. There is no support for more complex architectures, such as CNNs.
- Only CPU training is implemented. This means that we can not make use of additional computational power from GPUs.
For building more advanced neural networks it is a good idea to look to designated deep learning libraries. These all implement both CPU and GPU training and allow for using all kinds of different network architectures. Different libraries focus on different levels of abstraction, targeting audiences ranging from applied engineers to experienced researchers.
Keras is a commonly used library that typically serves as a good starting point for most deep learning projects. While being somewhat more complicated than scikit-learn, it is still a high level library that does not require expertise in the details of neural network design and training. Keras includes the building blocks necessary for applying deep learning to many different kinds of problems and data. It can also easily be extended in cases where the built in functionality is not enough. A typical example of this would be implementing a new loss function, customized for a specific problem.
Underlying Keras is the more low level library TensorFlow. While it is possible to use TensorFlow directly to build and train models, this is in many cases not needed and results unnecessary complexity. Low level libraries like TensorFlow are however useful when designing very customized models or optimization procedures, which is often the case in a research setting.
Another low level library of similar popularity is PyTorch. Both TensorFlow and PyTorch have their pros and cons, but have similar functionality and the choice often comes down to personal preference. PyTorch additionally has its own high level library, called lightning.
Many other deep learning libraries exist as well, often designed for specific application areas. The development of these libraries is rapid. Because of this it is important to consider up to date information when choosing a library for a new project.
Getting Started with Colab
Due to the large number of computations needed to train CNNs, it would be quite slow to work with the model directly on this course platform. We will therefore here make use of the Google Colaboratory platform (often referred to as just Colab) to train and evaluate our CNN model. Colab is a platform for hosting and executing python code together with written documentation. A nice feature of Colab is that one can choose to execute code using a GPU. This will speed up the training of our model substantially.
N.B. Running Colab requires a Google account. If you do not have an account and prefer not to sign up for one, it is possible to run the code locally on your own machine. This requires a local installation of Python and Jupyter Notebook. An easy way of getting these utilities is by installing Anaconda.
To get started with the Colab notebook, follow the steps below:
- Open the notebook at this link.
- If you are not signed in to a Google Account, do so by using the “Sign in” button on the top right of the page.
- (Optional, but recommended) If you at any point want to make changes to the code you need to first save a copy of the notebook. Do this by going to “File $>$ Save a copy in Drive” in the menu bar. This should open a copy of the notebook saved to the google Drive storage associated with your account. Close the original notebook and continue working in your own copy.
- Before running any code we need to tell Colab to use a GPU. Do this by going to “Runtime $>$ Change runtime type” in the menu bar. In the window that is displayed, make sure that “Hardware accelerator” is set to “GPU”.
At this point, read through the text in Colab and execute each code cell as they show up. You can execute a cell (i.e. a block of code and/or text) and move to the next cell by pressing Shift+Enter. This is convenient when stepping through the notebook. Answer the questions below when you are done.
Question C: Output Dimension
How many outputs does the model have?
Question D: Number of Parameters
How many parameters does the model have?
Question E: Class Predictions
At the end of the Colab notebook there is a small section where you can make predictions using your model on uploaded images. This and the following two questions will give you an image that is not in the GTSRB dataset. Two of the images depict traffic signs and one something completely different. For each question, save the image, upload it to the notebook and use your model to make predictions. Give your answers using the full name of the predicted class, starting with a capital letter.
What class is the following image classified as?
It is classified as:
Question F: Another Sign
What class is the following image classified as?
It is classified as:
Question G: Out of Context
What class is the following image classified as (according to the most probable class)?
It is classified as:
This image is clearly not a traffic sign, so there is no possibility for our model to classify this correctly. Still, the model will happily accept this image and produce a set of predicted probabilities! Looking at these probabilities, it even seems to have some idea of what type of traffic sign that is shown in the image. It could thus be argued that our model has learned to classify traffic signs with high accuracy, but without actually having any idea of what a traffic sign really is. As with most famous achievements in the general area of Artificial Intelligence, we have created a model that is very good at a narrow task. However, it has no idea of the context of this task and no “common sense” to object when we ask it to classify something that is not a traffic sign.
Conclusion
You have now seen how we can build and train a CNN for use in an important real-world task. We could easily consider this type of model being central to a traffic sign sub-system in an autonomous car. If that would be the case, the accuracy of the model suddenly becomes of great importance. Even a few misclassifications can be completely unacceptable when a physical vehicle will act on these predictions. It also emphasizes the importance of correctly estimating the error of the model on new, unseen data. Knowing the expected performance of the model would be central in determining whether it can be used in a real-world scenario or not.
As has been hinted at throughout this section, there are many option to explore when it comes to network architectures and hyperparameters. If you want to learn more about neural networks, CNNs and/or image classification feel free to go back and modify the Colab notebook. Can you design a network that achieves an even higher test accuracy?
Stallkamp, J., Schlipsing, M., Salmen, J. & Igel, C. (2012). Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323-332.
This webpage contains the course materials for the course TDDE56 Foundations of AI and Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2022 Linköping University