Why Do You Care About What You Measure?

What an experiment allows you to show is the presence of causal relationship between an independent variable and a dependent variable. In the design process, the system's design is most likely the independent variable (also called a factor or an explanatory variable) of your experiment. If you are comparing two designs in an A/B test, version A and version B are two values (also called levels) that the independent variable takes. The difference between the two designs could be anything, something as minor as the position and color of text blocks, to something slightly more substantial such as overall layout and the amount of "white space" on a page.

You let the experiment participants use version A and version B of the system prototype and see whether the dependent variable (also called outcome) changes. In the previous module, we discussed that the dependent variable could be a conversion rate or other usability metrics like how fast people can interact with the system or how many errors people make in interacting with the prototypes. If there is a meaningful difference between version A and version B of the design, an experiment lets you show that the change in the design causes a change in your dependent variable.

However, the change in the outcome does not directly show that design B (or design A) is "better" (Seltman, pp.9-11). For example, consider the case where you have two design versions of a system, and you are measuring how fast a person can complete a certain task (i.e., task completion time) using the two designs. If there is indeed a meaningful difference between version A and version B, you will see a significant change in the task completion time between the control and treatment conditions. But what you have shown is, "people can complete the task faster", but not "one design is better." You have to think, "why do I care about how fast people can interact with the system?" Or, "what makes me say that people being able to complete a task quickly is a hallmark of the 'good' software design?"

So, as you design the experiment and choose a dependent variable, think about how you will make a strong claim that changing your design causes the change in the quality of the system's design. (*) Once you can argue that measuring the dependent variable (e.g., task completion time) allow you to quantify what you really want to measure (e.g., quality of design), state that in a short sentence or two; this is similar to what we did in 14. Laboratory Study. You can say something like: “The goal of this experiment is to see if the change in the design of the prototype affects the usability of the system. We measure how fast our participants perform the tasks that we designed to...” Similar to what we discussed in the module for usability testing, design tasks that your experiment participants perform. You should design tasks so that measuring the outcome allows you to argue what you described in the goal.

Threats to Your Experiment

For your A/B test and the subsequent statistical analysis (e.g., NHST) to be trustworthy, the design of your experiment has to be valid. There are two basic types of validity: internal validity and external validity.

Internal validity

Internal validity is the degree to which we can appropriately conclude that the changes in a variable $X$ caused the changes in $Y$ (Seltman, 2014). The threat to internal validity is confounding variables. When any factors that are not the independent variable systematically influence the outcome, the study is said to be not internally valid. Eliminating such confounding factors is called controlling confounding factors. See 14. Laboratory Study.

Sometimes, you encounter factors that are hard to control. To mitigate the effect of uncontrollable potential confounds, we randomize treatment. We employ randomization to assure that all of the potential confounding variables are equal on average among the treatment groups.

External validity

External validity is synonymous with generalizability (Seltman, 2014). Your study should be designed so that the results of your A/B testing will hold even outside of the study setting. Imagine, for example, you are designing an application for SMU students. To test its usability, you recruit students from SCIS and ask them to participate in an experiment. While the results of the experiment would generalize to the "usability of the application for SCIS students", it may not generalize to the "usability of the application for SMU students as a whole".

Internal validity and external validity are often in a trade-off relationship. Increasing one could undermine the other. For example, controlling a variables like environmental factors could improve internal validity but reduce external validity. This is why we often want to triangulate; employ multiple study methods where one has good external validity and another one has good internal validity.

Null Hypothesis Significance Testing

https://youtu.be/bf3egy7TQ2Q

Null hypothesis significance testing (NHST) is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect based on a given observation (Pernet, 2015; Seltman, 2014). Let's say you are evaluating the design of two versions of a software system through A/B testing. That is, you are measuring some usability metric (e.g., task completion time, conversion for a given task) in a control condition where people use the current design and a treatment condition where people use the alternative design. We use NHST to see if an alternative design brings any difference in usability over the current design.

The first step of NHST is to declare null and alternative hypotheses that look like:

$H_0$ (null hypothesis): There is no difference in the usability metric that you measured.
$H_A$ (alternative hypothesis): There is a difference in the usability metric between the two conditions.

For example, the metric of your choice may be how long it takes for the users of the two interfaces to complete a given task. Here, the null hypothesis ($H_0$) would then be "there is no difference in the task completion time." And the alternative hypothesis ($H_A$) would be "the task completion time in performing the given task differ between the two conditions." (**)