ZIPT: Zero-Integration Performance Testing of Mobile App Designs

ZIPT: Zero-Integration Performance Testing of Mobile App Designs


– Design plays an important
role in the adoption and ultimately the success
of mobile apps today. Designing mobile apps however
is a long, complex process. Even a small feature has to go through multiple stages
of design iteration to make it from a feature
request to the hands of users. First, in the ideation
stage, designers brainstorm possible ways to build the feature. In the process, they
often look at examples of how similar features were
implemented in other apps. A few of those ideas generated
in the ideation stage will then make it to
the prototyping stage. For mobile apps this could
be with paper prototypes or high fidelity interactive
digital prototypes. Then, one of the prototype ideas will make it into the
implementation phase. Here, developers get involved, and they would help implement
the feature in the actual app. Finally, once the feature
has been sufficiently tested, it is released to actual end users. Now throughout these
different stages of design, designers would use multiple techniques to evaluate their designs. Prototypes are evaluated
with techniques such as heuristic evaluation, user testing, or even performance testing. These same techniques can also be used for the implemented app. In addition, platforms such
as user testing dot com that allows remote
collection of usage videos with voiceover narration can also be used. Finally, once the app is released, they can actually look at usage metrics to understand if the
feature works as expected. And if sufficient number
of users are available, they can also run A/B tests to
test small design variations and further refine the feature. So this is what the app design
process looks like today. In this work, we present a new way of testing mobile app designs. We call it zero integration
performance testing or ZIPT. ZIPT provides two unique advantages over current testing techniques. First, in the ideation
stage, it allows designers for the first time to understand the performance characteristics
of different examples that they are considering. This helps them make an informed decision about which examples
they want to draw from. Second, in the later stages, ZIPT allows competitive
benchmarking, by which I mean, comparing the performance of your design to your competitors designs. It does so with significantly lower costs and effort than existing techniques. ZIPT is able to provide these benefits by leveraging the large
number of design ideas present in existing apps. ZIPT can be used on any
publicly available android app with zero development effort. It depends on paid crowd workers right now to collect interaction data
and qualitative feedback. It then processes this information and surfaces relevant performance
metrics for designers. Let me now show what
using ZIPT looks like. So the first step is test creation. So to create a test, a
designer will upload an app through ZIPT’s web interface. They also describe a
task that they want users to perform on this app, and
also the number of users. For example, to test the
store locator feature on the Macy’s app, they could ask users to find the address of the closest store to a given zip code. Designers can also
include feedback questions such as qualitative questions related to rating difficulty of the app or the task, or qualitative questions such as asking about issues encountered
in performing the task. Once the designer has
submitted this information, ZIPT launches it on Amazon Mechanical Turk and lets crowd workers attempt the task. Crowd workers see a
streamed version of the app through ZIPT’s web interface,
and the system records detail interaction
traces in the background as they attempt the task. These traces capture the
different screens they visit and the interactions they
perform on these screens. If you’re interested in learning more about how the streaming
and the capture works, I’ll be talking more about it
during my second talk today in the crowd sourcing session at three 30. So once these traces are
collected, ZIPT processes them and presents designers
with different aggregate visualizations and
metrics that they can use to understand the performance
of these, of the tasks that they just requested. So, now actually, let me
show you the interface and show you what these
visualizations look like. So here’s the task. The designer here asked users to find the address of
the closest Macy’s store through the zip code given here. This is phrased as a
question so crowd workers will be expected to answer
this at the end of the session. So once enough crowd workers
have attempted this task, the designer can go and look at the different traces
that are generated. So the first trace that we show here is from the designer himself. So, this is the golden tracer. This is how the designer thought users will attempt the task. So we can see here the
different advice that the designer went to. And there are five of them. And we see that the
easiest way to do this task is to click on the hamburger
icon on the top left and open the menu, click
on find a store, and so on, and in five interactions, you can find the address
of the nearest store. And we see here that
on the last screen that the correct address should
be 170 O’farrell Street. We also see that it took the designer about 15 seconds to do this. Now if I go back, all
the other traces here are from the crowd workers. The designer can click on any of these and inspect how different crowd
workers attempted this task. Well, slow internet I guess. So in addition to interacting
with these individual traces, designers have the option of looking at three aggregate views of this data. The first aggregate view
shows them three metrics, the time taken by different
workers to attempt this task, the number of interactions
that they needed, and the completion rate
that’s self-recorded. So we can see here that
about 95% of the users were able to complete this task on Macy’s. And for most workers it
took less than a minute. On this interface the designers
can also inspect outliers. For example, they can look at users who could not complete the task. Those are shown in red here. Or they can look at users who
took an unusually long time. For example, if we click
on the red user here, we can go to that particular
users interaction trace. I’ll let this load. So we see here that this user
performed 25 interactions took more than two minutes, and still could not complete the task. So the designer can look at
the different ui screens here and try to understand what happened, and what we’ll see here is that instead of using the hamburger menu, the user here was trying
to use the search bar. And unfortunately on Macy’s, the search bar only
searches over products. So you cannot really search for locations with the search bar. So, by looking at these
outliers, the designers can identify very quickly
failure cases like these. Wow, it’s really slow. (laughs) Okay, there we are. Okay, so there are two
other aggregate views that the designer can look at. The one that I just showed
was the aggregate metrics. They can also look at, in
aggregate, all the user ratings. So they can look at, they
can see for example that users reported that this app was actually easy for them to use. And they can also look
at all the task answers, and we’ll see that most of the users gave the correct answer to this task. Again, they can look at
some of the outliers, such as the users who
answered this incorrectly. And the final aggregate view that they have of the data is this
floor visualization tool. Here, the different nodes
show the different ui’s and the different paths that users took throughout the app trying
to attempt this task. And in dark gray here,
we show the golden path that the designer had specified. What we see here is that 70% of the users take the first step that
the designer expected which is to click on the hamburger menu. So in this way designers
can get an understanding of how well this task performs by using data from crowd workers. We evaluated ZIPT in two ways. First, we performed a
series of case studies on 15 popular Android apps. We collected between 15
to 50 interaction traces from crowd workers and paid
about 30 cents for each trace. We also conducted interviews
with four designers, where we went over the case studies, and also let them use
ZIPT to explore the data for one of these case studies. I’ll now talk about some of the findings from these case studies. First, I’ll show two examples where we found usability issues in apps. Here, we asked users to pin images to a board in the Pinterest app. When we looked at one of the
outliers, who took a long time, and still couldn’t figure out the task, we found that she tried pinning images by clicking on the white icon that’s at the lower right corner of these images. This unfortunately
performs visual searches, a new feature in Pinterest. Perhaps a user expected
to see a menu there, which is a common interaction
pattern in Android. And the reason this works is that you can very clearly see
what users did in the app. One of our participants,
one of our designers, he said “it seems like there’s
very little that you lose “compared to having an over
the shoulder video camera.” For this example here, we
asked users to add a cookie to their daily log in
the personal food app. And we found that a lot of users dropped off on this particular screen and encountered an error. This happened because when entering the number of cookies they ate, they had to enter a unit on the right, which they forgot to do. When we showed this data to our designers, and we let them explore
this data for 10 minutes, they identified multiple
usability issues on the screen. For example, one of them said that, “The screen does not
clearly communicate that “something else was needed.” Designers also generated
multiple solutions to how we could fix
these usability problems. This included pre-selecting
one of the options, restructuring the data entry process, changing the component,
and modifying the layout. In our surveys, designers,
after this experience, believed that they had found at least one usability issue in the app. So this is how ZIPT can be useful for finding usability issues. ZIPT can also be used for
comparative evaluations. Here, we compared the
store locator feature in Macy’s versus Best Buy. In Macy’s there’s one way to
find the location of a store. In Best Buy there are two ways and both of them require fewer
interactions than Macy’s. What we found out though was
that people reported that Best Buy was harder and had
a lower task completion rate. When examining the traces,
we found one possible cause of confusion. Users did not really understand the scope of Best Buy’s search bar, which changes depending
on the selected tab. The user shown here for example,
tried to search for stores on the home tab, which
searches over products. In this example, we asked
users to create a new playlist and add two songs to
it on two popular apps, Youtube music and Spotify. We found that Youtube music
had a lower completion rate and that is because users could not create an empty playlist
first and then add songs. Spotify allowed that, as
well as the other way around, where you can go to a
song and then try to add it to a playlist and say that
it’s going to be a new one. It seemed like users
wanted this affordance to be able to create
an empty playlist first and this was something that a lot of users were confused by in Youtube music. One user even tried to use
the app’s health functionality to learn how to create playlists. During our interviews, one
of the designers mentioned that this is something he has
seen in other products before. He said, “People like to
think of containers as things. “You want to be able to
have an empty thing first “to put things in. “It’s how we think of stuff.” So hopefully, these examples
convince you that ZIPT can be useful for finding usability issues as well as performing comparative
evaluation between apps. So in terms of benefits, there are three big benefits of ZIPT that came
up during our discussions. First, ZIPT requires
significantly lower costs and effort to run comparative tests, compared to existing techniques. It is 100 x cheaper than using something like usertesting.com. Also, since we collect structured data, you get aggregation for free. The implication of lower cost and effort is that now it’s possible to do comparative testing at scale. The second unique benefit of ZIPT is that it collects both quantitative
and qualitative data. The quantitative data
helps understand what works or does not work. The qualitative data helps understand why. The final benefit that I’ll
talk about is that ZIPT allows learning from
the designs of others. Perhaps the designers at Youtube music new that users like creating
empty playlists first in their music apps. Now with ZIPT, everybody else can learn that using their app. Used in this way, ZIPT encourages testing earlier in the design cycle. Going forward, we hope to create a unified design exploration platform for mobile apps that
combines design search with a ZIPT like system to run
targeted inexpensive tests. Imagine what will be possible, if we are able to democratize
the design knowledge that remains embedded in inividual teams today and make it available for
everyone to learn from. If you’re interested in hearing more about the design’s search part of this, I’ll talk about it
again in my second talk, which is at three 30. Also, if you’d like to use
ZIPT for your own work, we have a signup form that you can find at interaction mining dot org slash ZIPT. Thank you. (audience claps) – [Female Questioner] I have a question. Let’s see. I know that there’s
multiple ways that you can get somewhere and your systems seem to afford having multiple designers suggested ways of getting to a goal. – Right. – [Female Questioner] Did you see any time where a designer saw people coming up with another way of
generating new things that they like better? – So actually in our studies
we did not let the designers create the task themselves. So we did the case studies ourselves, and so we found some of those instances, for example, the Spotify
and Youtube music case, we did not realize that Spotify had two ways of doing this. And then we realized
when people were confused on Youtube music, and were
trying to do it that way, we realized that Spotify has two ways, and half of the people
are doing it this way and others are doing it that way. So that’s an example
of us discovering that the app supports doing
this task in multiple ways. – [Gonzalo Ramos] Gonzalo
Ramos, Microsoft Research. This is amazing. I’ve been involved in creating apps, and I know how helpful this can be. I’m super-excited about a few directions, you mentioned something
about leveraging the signs for future creation. Do you see what you did applicable to flows that include,
not already app existing, but I have a sketch file or Adobe did experienced
design project that already has a little bit. It’s like an app. It’s not really an app, but it is an app. – Right – [Gonzalo Ramos] If you
see that this can be applied to that sort of material, to do analysis and a very tight design loop? – That’s an interesting question, because if I go back to the slide where I showed the design process. So we build a tool in a way that– We are hitting the ideation stage. We are hitting the
implementation and release stage, because we are working with apps here. And what you are talking
about, I think, is in the prototyping stage you have these little prototypes and you want to be able to use something like this there. Right now, the current tool is we have built it for
Android, so not right now. But I think the overall approach that we take for capturing data, and we capture three kinds of data. So we take screen shots. We capture Android view hierarchies, which is like a structured
representation of the ui, and we capture all the user interactions. So for some of these prototyping tools, if you’re able to capture
those three kinds of data, then you can use a similar way as us to kind of aggregate the data and show it in this manner to
designers and run tests on them. – [Ariel Weingarten] Ariel
Weingarten, Design Lab at UCSD. Love this, and I was wondering, if you checked into– I think it seems all above the beltway, the legality of using
other people’s websites and paying people to use those websites to learn things about their designs. Did you run into anything
interesting there or do you expect that someone
may tap you on the shoulder and just be like hey can you at least funnel back all the
learnings to us please? – Right, that’s an interesting question. We haven’t run into that yet. I know that in practice,
this is something that people do at a smaller scale. Competitive testing is
something people do, but you do it in person. You would download the apps to a phone. You will bring users to the lab, and you would do it at that scale. We are just making it easier
to do it at a larger scale. So in practice that happens. I don’t know the legal implications of studying other people’s apps. I don’t think I’m an expert on that, and I don’t think I
should comment on that. So not really, we haven’t run into that, but we’ll see what the future brings. – [Ariel Weingarten] Cool, thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *