Background decoration Background decoration

A Recent Shift in the Perception of Reinforcement Learning

The solution might not be where you expect it to be

The latest breed of AI can work things out for itself, without being taught by people

This is the subtitle of a commendable article in The Economist of October 21st, 2017. Trigger is the new AlphaGo Zero program by AI pioneers DeepMind and the corresponding article in the Journal Nature.

AlphaGo got first great press attention when it beat in 2016 Mr. Lee, one of the world’s best players at Go. This is impressive insofar as Go, a popular game in China, Japan and Korea, has been counted for a long time as impossible for computers to beat. The possible number of legal board arrangements is with 10¹⁷⁰ incredibly high. As a comparison, there are an estimated 10⁸⁰ atoms in the observable universe.

The newly minted AlphaGo Zero beats this older AlphaGo in 100 out of 100 games.


The first iteration of AlphaGo used what most people understand under machine learning, called supervised learning. It learned from over 30 million of human games, recognized patterns and applied them intelligently in its play (yes, gross simplification). This is the same method used in all the run-of-the-mill machine learning applications such as speech and image recognition.

The new AlphaGo Zero, uses no previous data, but learns from scratch, playing against itself. This method is called reinforcement learning. In the words of the Economist:

The program starts only with the rules of the game and a reward function, which awards it a point for a win and docks a point for a loss. It is then encouraged to experiment, repeatedly playing games against other versions of itself, subject only to the constraint that it must try to maximize its reward by winning as much as possible.

This is where in my opinion, reinforcement learning really leads to a paradigm shift. Advantages over supervised learning are:

  • You need no clean data set to begin with, which is one of the main pains with supervised learning
  • You don’t impose biases which come from the data set (i.e. if you try to imitate human behaviors, you copy their errors and are limited to their knowledge)
  • It is much faster, energy-efficient and the process of training the machine can be better automated

To cite The Economist again:

An algorithm that can learn without guidance from people means that machines can be let loose on problems that people do not understand how to solve. Anything that boils down to an intelligent search through an enormous number of possibilities, said Mr Hassabis (ed. DeepMind CEO), could benefit from AlphaGo’s approach.

Our company, GenLots, uses a similar methodology to AlphaGo Zero to successfully optimize purchasing for large industrial companies. GenLots has tested the application with a pharmaceutical group, whereby we find the optimal way to order raw materials using the various constraints, such as quantity discounts, cost of capital, costs per order, shelf life etc. to optimize for the total cost of ownership, which is our reward function. This has proven to save our clients several million annually and to be impossible doing with humans alone.

GenLots’ vision is to build the “artificial supply chain brain” which makes ALL decisions in the supply chain automatically and optimally, where hitherto humans did either not have the capacity or the time to make decisions. As those decisions aren’t taken at all today, GenLots doesn’t even operate to replace humans, but to optimize blind spots of a given company.

One of the most frequent questions we get from potential clients is “We don’t have clean data. How do you apply machine learning?”. Our answer, that we don’t need any data, seems always to come as a surprise. This is largely due to the media attention, which has been exclusively directed to supervised learning. It also anchored machine learning to big data in the heads of business people. Whereas supervised learning has its uses, I am incredibly grateful to DeepMind (and the Economist) to shift the public attention to the underreported machine learning breakthrough for business applications which reinforcement learning represents.

DeepMind cites protein folding, reducing energy consumption or searching for revolutionary new materials as domains of applicability of its new findings. Those are all lofty goals, but we should not forget that we can apply it immediately to industrial operations, causing concrete and large productivity gains.

Further reading:

GenLots stories