Category
>Machine Learning

What is Inverse Reinforcement learning?

Ayush Singh Rawat
May 18, 2021

Introduction

Inverse reinforcement learning is a lately advanced Machine Learning framework which could resolve the inverse conflict of Reinforcement Learning. Basically, IRL is about studying from humans. Inverse reinforcement learning is the sphere of studying an agent’s objectives, values, or rewards with the aid of using insights of its behavior.

Conceptually, our purpose is to research the reason which could offer better ideas alongside the process. But, like other deep learning or machine learning problems, it's miles difficult to model them explicitly. But we are able to borrow a page from the supervised information in deep learning.

We can fit a reward function with the use of professional demonstrations. Once a reward feature is fitted, we are able to use Policy Gradient, Model-based RL or different RL to locate the ideal policy.

For example, we are able to compute the policy gradient with the use of the reward feature as opposed to sampled rewards. With the policy gradient calculated, we optimize the policy closer to the finest rewards gain.

Inverse Reinforcement Learning: the reward function’s learning

In a traditional RL environment, the goal is to learn a decision-making process in order to generate behavior that maximizes a predefined reward function. Inverse Reinforcement Learning (IRL), as described by Andrew Ng and Stuart Russell in 2000, reverses the problem and instead tries to extract the reward function from an agent's observed behavior.

For example, consider the autonomous driving task. A naive approach would be to create a reward function that captures the desired behavior of a driver, e.g. E.g. stopping at a red light, staying away from the curb, dodging pedestrians, etc.

Unfortunately, this would require a full list of all behaviors we want to consider, as well as a list of weights indicating how important each behavior is (imagine Before, you need to decide exactly how much more important pedestrians are than stop signs).

As part of the IRL, the task is to collect a set of human-generated driving data and extract an approximation of that human's reward function for the task. Of course, this approximation necessarily relates to a simplified driving model.

However, much of the information needed to solve a problem is captured in the approximation of the actual reward function.

As Ng and Russell put it, "the reward function, rather than the guideline, is the most concise, robust, and transferable definition of the task" because it quantifies how good or bad certain actions are. Once we have the right reward function, the problem is finding the right guideline and can be solved using standard reinforcement learning methods.

(Most recommended: Fundamentals to Reinforcement Learning)

For our autonomous car example, we would use human driving data to automatically learn the correct functional weights for the reward. Since the task is fully described by the reward function, we don't even need to know the details of human politics as long as we have the right reward function to optimize.

In the general case, the algorithms that solve the IRL problem can be viewed as a method of using expert knowledge to convert the description of a task into a compact reward function.

The biggest issue with turning a complicated task into a basic reward function is that a single rule can be optimal for a variety of reward functions. That is, when we are looking at expert acts, there are a variety of incentive attributes that the expert might be attempting to optimise.

Guidelines are incomprehensible: For example, for the incentive function, which is zero everywhere, all recommendations are optimal. As a result, this incentive function is still a viable option for solving the IRL dilemma. However, for our purposes, we'd want an incentive function that collects job data and can easily distinguish between preferred and undesirable guidelines.

To solve this, Ng and Russell formulated inverse reinforcement learning as an optimization problem.We want to select a reward feature for which the given Expert Policy is optimal. However, given this limitation, we would also like to select a reward feature that further maximizes certain important properties.

Inverse RL for finite spaces

First, consider the set of optimal guidelines for a Markov Decision Process (MDP) with a finite state space S, a set of actions A, and transition probability matrices Pa, for each action.

Reward Function, source

In their article, Ng and Russell showed that a guideline π given by π (s) ≡a1 is optimal if and only if for all other actions a the reward vector R (which lists the rewards for every possible state) satisfies the condition

(Pa1 - Pa) (I - γPa1) −1R⪰0

The meaning of this theorem is that it shows that we can use linear programming algorithms to efficiently select the "best" reward function for which the guideline is optimal.

In this way, the authors can formulate IRL as a manageable optimization problem in which we try to optimize the following heuristics so that a reward function fits well with the expert's data:

Maximize the difference between the quality of the optimal action and the quality of the next best Action (subject to a limit on the amount of the reward to avoid arbitrarily large differences). We want a reward function that clearly distinguishes the optimal guideline from other possible guidelines.
Minimize the size of the rewards in the reward vector / function. In general, the intuition is that the use of small rewards encourages the simpler reward function, similar to regularization in supervised learning. Ng and Russell choose the L1 norm with an adjustable penalty coefficient, making the reward vector non-zero in only a few states.

Estimated Reward Function, source

(Must check: Top Deep Learning Algorithms)

Inverse Reinforcement Learning through Sampled Trajectories

The IRL algorithm is also defined by Ng and Russell in the following cases: it is not a total optimal criterion; we can only sample the course of the optimal criterion, that is, we only know the condition of the criterion in a small number of episodes, acts, and rewards; and it is not a complete optimal criterion.

Politics, on the other hand, cannot solve the problem by itself. This is a more common occurrence in application cases, particularly when dealing with human expert data.

To solve this problem, we use a sequence of functions to replace the reward vector used for the finite state space with the value of the linear approximation function i of the reward function to obtain the true feature vector ϕ I (s).

These functions can be used to extract important information from a high-dimensional state space (for example, a car's location can be saved over time and its average speed can be stored as a function). The following conditions apply to each characteristic ϕi (s) and weight αi:

R (s) = α1ϕ1 (s) + α2ϕ2 (s) + ⋯ + αdϕd (s).

For each feature, we're expected to find the value that best fits the αi.

With the aid of IRL and an exploration trajectory, the reward function is enhanced by comparing the worth of all policies in a set of k created policies and selecting the best of them. We finish the algorithm by generating random reward function weights and then initialising our candidate. As a result, the algorithm's key moves are:

With the aid of the average accumulated reward from a set of random samples, the value of our optimal guideline must be determined for the initial state V ^ π (s0), as well as the value of each guideline that is created V ^ πi (s0).
A linear programming problem must be used to approximate the reward function R. Set αi to optimise the difference between our best guideline and any of the other k created guidelines.
Exit if there are a sufficient number of iterations. Otherwise, use a regular RL algorithm's knowledge to find the best guideline for R.
Since our approximate reward function is not always equal to the reward function we are aiming for, this guideline may differ from the optimal guideline.
The process is then replicated, with the newly created policy being applied to the current set of k nominee policies.

(Must read: Machine Learning Tutorial)

Apprenticeship learning from a teacher’s strategy

We may specifically learn techniques with success equal to that of the experts in addition to learning the reward mechanism from the experts. If we have an expert policy that is only next to the finest, it'll be very helpful.

In this case, we can use the machine learning algorithm rules for learners developed by Pieter Abbeel and Andrew ng in 2004 as a solution. The algorithm employs MDP and the "teacher's" best approximation approach, allowing it to learn methods that are equal to or better than the teacher's.

This minor exploratory capability seems to be extremely useful in delicate missions such as autonomous helicopter flight. The conventional rules of the reinforcement learning algorithm allow for a random search to begin, nearly causing the helicopter to crash in the first experiment.

In an ideal world, we'll be able to use our experience to create a baseline policy that can be safely developed over time. In any case, the baseline strategy should be far better than the initialization strategy, since this would hasten convergence.

Applications of IRL

Some of the commonly used examples of IRL are:

It helps in simulating highway driving and prevents accidents.
It assists in navigation using aerial images and in parking lots navigation.
It allows path planning to be done for humans and in predicting goals.

Conclusion

We have seen above why the inverse RL is useful in resolving the inverse effect of Reinforcement learning but it also comes with disadvantages of its own, being:

The problem arises as there are many fitting reward functions for most behaviour observations. The solutions come up with numerous degenerate solutions, i.e. assigning zero reward to all states.
The IRL algorithms often consider every observed behavior as optimal. This is a strong assumption and can backfire incase of human demonstrations.

These points show us that even IRL is not mistake prone and can cause some problems in carrying out the right reward function.