## Introduction

Such a shitty thing as A/B testing is used too often. It is defective by definition: divide users into two groups. Show group A the old-tested site. Show group B the shitty new design with big buttons and slightly edited text. After some time, check the status and see which group clicked on the "Order" button more often. Sounds good, right? But the problem is right in front of you. It's the same dilemma that new drug developers have. During the tests, you only give half of your patients a life-saving drug. The other half get a placebo. Even if your miracle cure works, you still lose the B patients. That sacrifice is made in order to get the data. But you can do things differently!

Hundreds of top minds have been working hard in recent years, but not to cure cancer. Instead, they've been honing algorithms that make you click on more advertisement banners. And it works. Both Google and Microsoft are focusing on using b**о**The mainstream services nowadays ("mainstream services") have a lot of information about their users in order to predict what to show them. It's strange, but in mainstream services now (*n.a.*.: for 2012) there is nothing better than A/B tests, but I hope to remedy this situation by raising awareness of more effective techniques.

Using 20 lines of simple code that you can write today, you can always get better results than with normal A/B testing, sometimes 2-3 times better! This method has several advantages:

- You can test more than two options at once. A,B,C,D,E,F,G etc.
- New options can be added or removed at any time

And one of the most attractive things is that you can set everything up once and forget about it! If you value your time, you don't have the time to go in and see how a particular option behaves and choose the best one. Just let the algorithm work. Those 20 lines of code quickly and automatically find the best option and use it until it's no longer the best option.

## The task of the multi-armed bandits

The terminology of the task of multi-armed bandits came to us from casinos. You have a pile of slot machines in front of you, each with its own lever. You suspect that some machines have a higher winning percentage than others. How do you determine which machine is the best and get the most money for the least amount of tries?

As with many other machine learning techniques: the simplest strategies are hard to improve. No, the more complex techniques need to be kept in mind, but they usually only squeeze out a few hundredths of a percent extra. One strategy that has performed well in practical problems is *epsilon greedy* algorithm. The point is this: we always keep track of how many times we yank the lever and how much profit we get from it. In 10% cases we choose the lever randomly, the remaining 90% times we choose the lever with the highest probability of winning.

```
def choose():
if math.random() < 0.1:
# explore!
# choose random lever in 10% cases.
else:
# exploit!
# for each lever,
# count the probability of winning.
# = number of wins divided by number of tries
# i.e. by the number of times we "pulled the lever".
# choose the lever with the highest chance of winning
# increment the number corresponding to how many times we have pulled the lever
# save the data somewhere like Redis
def reward(choice, amount):
# add the reward to the total amount of reward for the selected lever
```

## Why does it work?

Let's say we choose a color for the "Buy Now" button. Orange, green, or white are available. Initially, we set all three options to have the same chance of winning: one win in one attempt, that is, 100%. It doesn't really matter what initial values we set, the algorithm adapts. So, when we start, our starting data look like this:

Orange | Green | White |
---|---|---|

1/1 = 100% | 1/1=100% | 1/1=100% |

Then a user comes to our site, and we have to show him our button. We choose the first one with the highest probability of winning. The algorithm sees that they are all the same now, so it takes the first one: the orange one. But... the user doesn't click the button.

Orange | Green | White |
---|---|---|

1/2 = 50% | 1/1=100% | 1/1=100% |

Then the second user comes in. We definitely don't show him the orange button, because it now has only a 50% chance of working. So we choose the green one. All the same happens for a few more visits, and we just go through the options, updating our data each time.

Orange | Green | White |
---|---|---|

1/4 = 25% | 1/4=25% | 1/4=25% |

But suddenly someone clicks on the orange button! The browser sends an Ajax request, and we update the resulting table:

Orange | Green | White |
---|---|---|

2/5 = 40% | 1/4=25% | 1/4=25% |

When our designer sees this, he scratches his head thoughtfully: what the fuck?! The orange button sucks! It has a small font size. The green one is obviously better. It's gone! "The greedy algorithm will always choose the orange one.

But let's wait and see what happens if orange isn't really the best option. Since the algorithm thinks the orange button is top, it will always show it now. Until it is no longer the best option.

Orange | Green | White |
---|---|---|

2/9 = 22% | 1/4=25% | 1/4=25% |

After we shed the traffic the best option will be found (if there is one, of course) and it will be shown in 90% cases. Here are the results, based on data from a real site I worked on.

Orange | Green | White |
---|---|---|

114/4071 = 2.8% | 205/6385=3.2% | 59/2264=2.6% |

## What about randomization?

We did not discuss the part about randomization. Randomizing the lever in 10% cases forces the algorithm to explore the available options. It is a tradeoff between looking for something new in the hope that this new thing will work better, and using something that already works well. There are several variants of the epsilon greedy algorithm. In one of them, the algorithm takes 100% of time exploring options in the beginning, and only after it has accumulated enough data does it switch to exploitation. In the other algorithm, you can gradually decrease the percentage of exploration over time. But the epsilon-jaded algorithm I described is a good balance between simplicity and performance. Exploring other algorithms like UCB, Thompson sampling, and those that take context into account is fucking awesome, but optional if you want something that just works.

## Wait a minute, so why isn't everyone using it?

A lot of people find statistics hard to understand. People don't trust things they don't understand, especially they don't trust machine learning algorithms, even if they are simple. Mainstream services and tools don't support these algorithms because then they would have to teach how to use them, teach statistics - and that's hard! There may also be a few typical objections:

- Showing different options at different frequencies will skew your results! (No, you won't. You always have a calculated CTR for each option).
- No adaptation to change! (Your visitors probably don't change. But if you want to make such adaptation just change reward function and multiply old reward value by some factor (forgetting factor).
- This thing won't take out changes of several things that depend on each other at once! (True, but the A/B tests won't take that shit out either.)

## From the interpreter

Even considering the above paragraph, I don't understand at all why the **professional** The tools for traffic arbitrage (in particular, trackers) have not yet implemented the ability to use such algorithms in their work. After all, b**о**Most of them are easy to write and use, and the benefits are obvious and will be very tangible in monetary terms, both for solo arbitrators and for teams of all sizes.

For those interested in the topic there is a great book by O'Reilly Publishers, where a few algorithms of multi-armed bandits are disassembled with examples, here her Russian version.

And with you as of the translator of this article was Yellow Web, pour in the plus, gentlemen!

Уведомление: Maximizing Profits Using the Keitaro Multi-Handed Bandits Algorithm | Yellow Web

BenderHmm, isn't this implemented in trackers? It looks like a polichinelle secret, anyone who fully uses tracker, pours and optimizes - it uses. Roughly speaking, in the beginning 4-5 prelends are tested, as the data is collected, the percentage of traffic to weak streams decreases, leaving one working and a few percent for the tests, in the case of adding new ones - also on them a small percentage. These settings are in any tracker)

Yellow WebNo, there is nothing similar in trackers nowadays. Manual adjustment of % traffic to different links is fine, but why when there are algorithms that can change this same percentage in automatic mode and do it better?