Foundations of AI and ML

Content

Lisam

Course materials

Section

Introduction to Reinforcement Learning

Reinforcement Learning

Welcome to the section on Reinforcement Learning (RL)! In this lecture we introduce what RL is and the Q-learning method.

Lecture slides are available here.

Basics of RL I

Reinforcement learning methods learned through …

Applications of RL

Which of the following is a practical example of reinforcement learning?

Basics of RL II

What are actions in reinforcement learning?

What is a state in RL?

How do you represent the agent state in reinforcement learning?

Basics of RL III

What are the elements of reinforcement learning?

Reward.

Policy.

Agent.

Framework.

Hands-on Reinforcement Learning

In this exercise, you will play around with the Q-learning algorithm. The task is to use OpenAI’s Taxi environment and train an agent to pick up passengers and drop them off at the destination using the least number of moves.

Environment

There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When an episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends. States are illustrated as follows:

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

Actions

There are 6 discrete deterministic actions that the agent can perform:

0: move south
1: move north
2: move east
3: move west
4: pickup passenger
5: drop off passenger

Observations

There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.

Note that there are 400 states that can actually be reached during an episode. The missing states correspond to situations in which the passenger is at the same location as their destination, as this typically signals the end of an episode.

Four additional states can be observed right after a successful episode, when both the passenger and the taxi are at the destination. This gives a total of 404 reachable discrete states.

Rewards

The reward function is defined as follows:

-1 per step unless other reward is triggered.
+20 delivering passenger.
-10 executing “pickup” and “drop-off” actions illegally.

Your Task

The following code is working properly as it is and trains an agent in the environment. The performance of the agent, however, it quite bad given the parameters that are used for learning. Adapt the parameters such that the agent behaves better.

Linköping University
Department of Computer and Information Science

This webpage contains the course materials for the course TDDE56 Foundations of AI and Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2022 Linköping University