Experimentation @Intuit — Part 2 — Design of Experiments

Anil Madan
QuickBooks Engineering
6 min readSep 17, 2018

--

In Part 1 we looked at how Intuit’s culture of design thinking has evolved to embrace rapid online experimentation. In Part 2 we will look into Design of Experiments.

Any experiment is a continuous process of design, execute and analysis. Let’s take a closer look at each.

Fig 1: Experimentation lifecycle flywheel

Design

There are some basic concepts in experimentation

  • Controlled Experiments — A controlled experiment is an experiment done under controlled conditions. E.g. a prospective user is shown an alternative experience to increase sign-ups.
  • Factor (or variable) — A variable that can changed independently (cause) to create a dependent response (effect) . Factors generally have assigned values, sometimes called levels. E.g. you can change the background color of a web page or a label of a button to measure click through rates. Background color and label of a button in this example are factors while factor values are ‘Free Trial’ or ‘Buy Now’.
Fig 2: Free Trial and Buy Now Buttons
  • Treatment (or variant) — The control, is usually the current system and considered the “champion” while the treatment, is a modification that attempts to improve something — the “challenger”. A treatment is characterized by changing the level(s) in one or more factors.
  • Experimental Unit — The experimental unit is the physical entity which can be assigned, at random, to a treatment. Commonly it is a visitor or a signed in user. It is an entity on whom the experimentation or analysis is done e.g. visitor, user, customer, etc. In the digital world (web, mobile, etc.) the user is the most common experimentation unit, although some experiments may be done on prospects or page views.

E.g. the following two are treatments on the QuickBooks home page.

Fig 3: Treatment 1 — All Products are in one place, Treatment 2 Products are shown distinctly
  • Sample — A group of users who are served the same treatment.
  • Overall Evaluation Metric or Criteria— refers to what measure , objective or goal we are trying to achieve. It is a metric used to compare the response to different treatments.
  • Single-factor testing –A form of testing in which treatments corresponding to values of a single-factor e.g. is there a free trial button on the page ? Yes or No?
  • Multi-factorial testing (MVT) –A method of testing in which treatments corresponding to multiple-values of multiple-factors are compared — e.g. is there a free trial button on the page ? Yes or No? And where is the location of the button ? Top or Right Rail?

With the basic concepts defined , we conduct experiments at several distinct stages in the product lifecycle with the goal to continuously learn and iterate.

Fig 4— Experimentation Type Funnel

The experimentation funnel starts with simulations in our big data systems. This generally helps us validate a few hypotheses quickly and more importantly discard the ones that don’t make sense. Once a minimum viable product is developed we launch an Alpha. Alphas are playgrounds launched internally to our employees to get feedback with a goal to gather qualitative data. Post the alpha phase we launch Betas that are opt-ins (in our QuickBooks labs) again with the goal to gather qualitative data. Finally A/B or Multivariate Tests are online controlled experiments that we conduct to get quantitative data and tests for statistical significance. These are full blown experiments.

Setup

Experiment

Our experimentation tool has a simple self service interface to create an experiment by specifying the name , duration — start date and end date — and region and setup treatments by specifying the allocations (or what % of traffic would be directed to it).

Fig. 5— Experiment Tool , Create Experimentation Screen

Treatment and Factors

The treatment screen captures the name , optionally the factors and the allocation range — allocation range is used to determine if the user experiences the treatment. Technically if userId modulous 100 falls in this range than the user experiences this treatment.

Fig. 6 — Experimentation Tool , Create Treatment Screen

Mutually Exclusive Spectrum

Fig. 7— Mutually Exclusive Experiments

A simple experiment on Sign-Up allocates 5% each on control and treatment. A user with userid modulus 100 = 25 maps to the control group and will be shown the default Sign-Up experience. Another experiment on Home Page allocates 10% each to control and treatment. Another user with userid mod 100 = 45 maps to the Home Page treatment. In the diagram above Sign-Up and Home Page are setup as two mutually exclusive experiments — i.e. at any given time a user is in one of the two experiments.

Overlapping Spectrums

Fig. 8— Interleaving Orthogonal and Exclusive Experiments

While the exclusive segments is a good way to separate traffic and avoid collisions it does not scale to the growing need of experimentation within Intuit . Since at any given time we need to run thousands of experiments we need a way to create overlapping segments. As such we need to think of experiments that can run orthogonal to each other with uniform random collisions.

In the figure above you have 5 groups of experiments that show the power of exclusive and orthogonal spectrums.

  • Group A — Home Page and Profile are setup orthogonal to each other.
  • Group B — Login, Sign-Up and Settings are set are setup orthogonal to each other.
  • Group C — is reserved to run A/A test to validate the platform itself. We monitor the statistical framework used for decision making by maintaining a pool of AAs —( experiments whose treatment does not introduce any sort of change) which allow to validate its theoretical properties (e.g. false positive rates or the statistical distribution of metrics).
  • Group D — is exclusively reserved to run an experiment on Reports and Insights.
  • Group E — is used to ramp-up the winning treatment of an experiment that was originally proven to be successful.

The groups A,B,C and D use an exclusive spectrum - the experiments across them don’t statistically collide with each other as that can pollute the results. Inside each group there may be overlapping orthogonal experiments that collide uniformly across control and treatments.

The Hashing Constant (hashId) (in Fig. 5) serves as an input to an MD5 based algorithm used to uniquely define the orthogonal plane. As we want to run thousands of concurrent experiments, different hashIDs imply that the randomizations between active experiments are orthogonal — Google , Microsoft Bing and LinkedIn use similar approaches.

Segmentation

We also allow to further segment users to target different sub-populations.

Fig. 9— Segmentation

Built in Attributes — The platform has access to more than 300+ built-in customer attributes for experimenters. They range from static attributes such as user subscription status to dynamic attributes such as a member’s last login date. These attributes are either computed daily as part of our data pipelines or stored real time in our profile infrastructure.

Contextual Attributes — These attributes are only available at runtime, such as the browser type , user country or mobile device . For example, to target only requests coming from iPhone9, one just needs to inform the platform that an attribute called “deviceType” is to be evaluated at runtime, and target only those with the value equal to “iPhone9”.

In part 2 hopefully you got a good sense of how we design and setup experiments and how the platform is scaled to support thousands of concurrent experiments. In Part 3 we will look into the execution engine that serves experiments.

--

--