Introduction
Such a crappy thing as A/B testing is used too often. It is defective by definition: divide users into two groups. Show group A an old, proven site. Show Group B the cool new design with big buttons and slightly edited text. After some time, check the stats and see which group clicked the “Order” button more often. Sounds kind of good, right? But the problem is right in front of you. This is the same dilemma faced by drug developers. During tests, you only give half of the patients the life-saving drug. The latter receive a placebo. Even if your miracle drug works, you still lost patients from group B. This sacrifice is made in order to obtain data. But you can do everything differently!
In recent years, hundreds of the best minds have worked hard, but not to cure cancer. Instead, they honed algorithms that make you click on banner ads more often. And it works. Both Google and Microsoft are focusing on usingOmore information about users in order to predict what to show them. It’s strange, but in mainstream services now (approx. translation.: for 2012) there is nothing better than A/B testing, but I hope to improve this situation by raising awareness of more effective techniques.
With 20 lines of simple code that you can write today, you can always get better results than regular A/B testing, sometimes 2-3 times better! This method has several advantages:
- You can test more than two options at once. A,B,C,D,E,F,G, etc.
- New options can be added or removed at any time
And one of the most attractive things is that you can set everything up once and forget about it! If your time is valuable to you, then you stupidly have no time to go in and see how this or that option behaves there, and choose the best one. Just let the algorithm work. These 20 lines of code quickly and automatically find the best option and use it until it is no longer the best.
Multi-armed bandits problem

The terminology for the problem of multi-armed bandits came to us from casinos. There are a bunch of slot machines in front of you, each with its own lever. You suspect that some machines have a higher winning percentage than others. How to determine which machine is the best and get the most money in the least number of attempts?
As with many machine learning techniques, the simplest strategies are difficult to improve. No, more complex techniques need to be kept in mind, but they usually only allow you to squeeze out a few hundredths of a percent extra. One strategy that has worked well in practical problems isepsilon-greedyalgorithm. The point is this: we always keep track of how many times we pulled the lever and how much profit we got from it. In 10% of cases we choose a lever randomly, the remaining 90% of times we choose the lever with the highest probability of winning.
def choose():
if math.random() < 0.1:
# exploring!
# choose a random lever 10% of the time
else:
#exploit!
# for each lever,
# We calculate the probability of winning.
# = number of wins divided by number of attempts
# i.e. by the number of times we “pulled the lever”.
# select the lever with the highest chance of winning
# increase the number responsible for how many times we pulled this lever
# save data somewhere like Redis
def reward(choice, amount):
# add a reward to the total number of rewards for the selected lever
Why does this work?
Let’s say we choose a color for the «Buy Now» button. Available in orange, green or white. Initially, we give all three options the same chance of winning: one win from one attempt, that is, 100%. It is not so important what initial values we set, the algorithm adapts. So when we start, our starting data looks like this:
| Orange | Green | White |
|---|---|---|
| 1/1 = 100% | 1/1=100% | 1/1=100% |
Then a user comes to our site, and we need to show him our button. We choose the first one with the highest probability of winning. The algorithm sees that now they are all the same, so it takes the first one: orange. But… the user does not click on the button.
| Orange | Green | White |
|---|---|---|
| 1/2 = 50% | 1/1=100% | 1/1=100% |
Then the second user comes. We definitely won’t show him the orange button, because now it only has a 50% chance of working. That’s why we choose green. The same thing happens for a few more visits, and we simply go through the options, updating our data each time.
| Orange | Green | White |
|---|---|---|
| 1/4 = 25% | 1/4=25% | 1/4=25% |
But suddenly someone clicks on the orange button! The browser sends an Ajax request, and we update the resulting table:
| Orange | Green | White |
|---|---|---|
| 2/5 = 40% | 1/4=25% | 1/4=25% |
When our designer sees this, he scratches his head in thought: what the fuck??! The orange button sucks! It has a small font size, but the green one is obviously better! Everything is gone! The «greedy» algorithm will now always choose orange!
But let’s wait and see what happens if orange isn’t actually the best option. Since the algorithm thinks that the orange button is at the top, it will always show it now. Until she stops being the best.
| Orange | Green | White |
|---|---|---|
| 2/9 = 22% | 1/4=25% | 1/4=25% |
After we have processed the traffic, the best option will be found (if there is one, of course) and it will be shown in 90% of cases. Here are the results based on data from a real site I worked on.
| Orange | Green | White |
|---|---|---|
| 114/4071 = 2.8% | 205/6385=3.2% | 59/2264=2.6% |
What about randomization?
We haven’t discussed the randomization part. Randomly selecting a lever in 10% of cases forces the algorithm to explore the available options. It’s a trade-off between looking for something new in the hope that the new one will work better, and using something that already works well. There are several variants of the epsilon-greedy algorithm. In one of them, at the beginning, the algorithm spends 100% of the time exploring options, and only after it has accumulated enough data does it switch to exploitation. In another algorithm, you can gradually reduce the percentage of studies over time. But the epsilon-greedy algorithm I described is a good balance between simplicity and performance. Exploring other algorithms like UCB, Thompson sampling, and those that take context into account is awesome, but optional if you want something that just works.
Wait a minute, why isn’t everyone using this?
Statistics are difficult for many people to understand. People don’t trust things they don’t understand, and they especially don’t trust machine learning algorithms, even if they are simple. Mainstream services and tools do not support these algorithms because then they would have to teach them how to use them, teach them statistics — and this is difficult! There may also be several typical objections:
- Showing different options with different frequencies will distort the results! (No, don’t get it wrong. You always have a calculated CTR for each option).
- No adaptation to changes! (Your visitors most likely do not change. But if you want to make such an adaptation, simply change the reward function and multiply the old reward value by some factor (forgetting factor)).
- This thing will not take out changes to several things that depend on each other at once! (True, but A/B tests won’t get rid of such crap either)
From the translator
Even considering the above paragraph, I have absolutely no idea whyprofessionalTools for traffic arbitration (in particular in trackers) have not yet implemented the ability to use such algorithms in their work. After all, it wouldOMost of them are easy to write and use, and the benefits of implementation are obvious and will be very noticeable in monetary terms for both a solo arbitrator and teams of any size.
For those interested in the topic, there is an excellent book published by O’Reilly, where several algorithms for multi-armed bandits are analyzed with examples, hereherRussian-language review.
And with you astranslator of this articlethere was Yellow Web, go ahead, gentlemen!



Уведомление: Максимизируем профит, используя алгоритм "многоруких бандитов" в Кейтаро | Жёлтый веб
Хмм, разве это не реализовано в трекерах? Выглядит как секрет полишинеля, любой, кто полноценно пользуется трекером, льет и оптимизирует — это использует. Грубо говоря в начале тестится 4-5 прелендов, по мере сбора данных процент трафика на слабые потоки уменьшается, остается один рабочий и несколько процентов на тесты, в случае добавления новых — также на них процент небольшой. Эти настройки есть в любом трекере)
Нет, ничего похожего в трекерах нынче нет. Настройки регулирования % трафика на разные ленды вручную — это прекрасно, но зачем, когда есть алгоритмы, которые могут изменять этот самый процент в автоматическом режиме и делать это лучше?