We have discussed methods to evaluate the design of a software system, such as inspection methods that rely on UI design experts assessments (8. Inspection Methods) and usability test in which we let users evaluate the system's usability (12. Evaluating Design ). What we have not discussed so far is how we expose the design to the evaluators.
The style of design evaluation methods that we have discussed thus far is classified as an observational study. For example, in the heuristic evaluation with UI design experts, we asked the evaluators to use a system prototype and let them comment on the design based on Nielsen's ten heuristics. Here, the evaluator had control over which part of the UI design and on which heuristic to comment on. And we did not intervene by asking the evaluators to compare two (or more) designs. Such evaluation methods are useful and widely used by designers. However, the methods could be susceptible to evaluators' bias and personal opinions. Furthermore, the lack of a principled procedure to compare two (or more) designs prevents us from discussing the causal relationship between the design difference and its effect on usability.
An experiment is a procedure in which you test the effects of the intervention on one or more outcome measures. In the context of the design process that we have been learning thus far, the intervention is equivalent to letting evaluators be exposed to two (or more) designs. And as for outcome measures, we are often interested in some aspects of system usability. Unlike observational studies, the decision of which evaluator tests which design is under the experimenter's control (Seltman, p.196). By following the procedure that we describe below and in the future modules, an experiment could let you show the causal relationship between the change in design and its impact on usability. Also, the rigor in the measurement and testing process gives more confidence when one argues one interface design is better than the other and allows you to objectively discuss the benefit of making a design change (Preece et al., pp.484-488).
An experiment that allows you to objectively argue the benefit of one design over the other is powerful. Let's see an example case from (Kohavi et al., 2008). The story is as follows:
Greg Linden, then-employee of Amazon, suggested implementing a personalized recommendation for product suggestions in the check-out page based on what is in the customer's shopping cart. The feature sounded like a good idea; it could cross-sell more items and increase the basket size per person. But then-Senior Vice President told him to stop the project, as such a feature would distract a customer from checking out (thus reducing the conversion rate). Linden did not listen and conducted a simple experiment, showing a massive increase in sales due to the personalized recommendations. The data convinced the top executives to deploy the feature officially. The moral of the story is that we should not naively listen to HiPPO, Highest Paid Person's Opinion, but design based on tangible data.
You could conduct an experiment in the lab or remotely. When the experiment is conducted remotely on the web, it is called a web experiment. You can use tools like web analytics tools or online usability testing tools.
Web analytics is a method of tracking users’ activities on a web page, such as how long people stay on a page, which other sites they came from, what ads they looked at and for how long, and so on. The collected data can be turned into useful behavioral information such as conversion rate—the percentage of users who took a desired action (Preece et al., p.264 and p.514).
Web analytics can be used for A/B testing (sometimes also called a split test or a bucket testing), a simple form of a randomized experiment to compare the quality of two versions of a system. In A/B testing, you recruit a number of people as evaluators. Next, you split them into two groups. You then ask one group to use the one version of the system (version A) and another group to use another version (version B). Both groups use the system to measure metrics like conversation rate or task completion time through web analytics.
Often, when people conduct an experiment, they are interested in whether making some design change to the system affects some outcome. In such a context, the old (or currently used) system is referred to as version A. And the design with the design change is referred to as version B.
By convention, the condition where a group of people use version A of the system is called a control condition. And the group assigned to this condition is called a control group. On the other hand, the condition in which people use version B of the system is called a treatment condition. And the group assigned to this condition is called a treatment group.
We mentioned a term conversion rate above. A conversion rate is the percentage of users who take the desired action. For example, you might want to measure, of all the people who visit an e-commerce website, what is the ratio of the people who buy items on the website. This is a conversion rate of “completing a purchase transaction”. Discussing the outcome in terms of conversion allows you to discuss the success or goal metrics clearly. But, of course, you can measure other usability metrics like time taken for one person to complete a task in your experiment.