Describing UI Screenshots in Natural Language

Caption examples for image ID 62242
XUI Caption:
A list app having a large image element placed at the top part of the screen.
XUI Simple:
A list app with a large image element placed at the top area. You can see a list of elements, typically arranged in rows.
XUI Detailed:
The screenshot is a list screen. It shows a list of elements, typically arranged in rows. You may notice a large image located at the top area.
Microsoft Azure:
A screenshot of a cell phone.
Human:
List of elements about the song you're hearing.
Caption examples for image ID 23622
XUI Caption:
A mediaplayer screen having a large background image element located at the center part of the screen.
XUI Simple:
The interface looks like a mediaplayer app with a large background image component located at the center area of the screen. You can see a music or video playback functionality.
XUI Detailed:
That app must be a mediaplayer screen. It can be noticed a music or video playback functionality. You are likely to see a large background image placed at the center area of the screen.
Microsoft Azure:
Mike Zito talking on a cell phone screen with text.
Human:
It's the screenshot of a mediaplayer where you can see the singer and the image of the album.

Abstract

Being able to describe any user interface (UI) screenshot in natural language can promote understanding of the main purpose of the UI, yet currently it cannot be accomplished with state-of-the-art captioning systems. We introduce XUI, a novel method inspired by the global precedence effect to create informative descriptions of UIs, starting with an overview and then providing fine-grained descriptions about the most salient elements. XUI builds upon computational models for topic classification, visual saliency prediction, and natural language generation. XUI provides descriptions with up to three different granularity levels that, together, describe what is in the interface and what the user can do with it. We found that XUI descriptions are highly readable, are perceived to accurately describe the UI, and score similarly to human-generated UI descriptions. XUI is available as open source software.

Research highlights

Resources

Citation


@Article{Leiva22_xui,
  author  = {Luis A. Leiva and Asutosh Hota and Antti Oulasvirta},
  title   = {Describing UI Screenshots in Natural Language},
  journal = {ACM Transactions on Intelligent Systems and Technology},
  volume  = {14},
  number  = {1},
  year    = {2022},
}
    

Disclaimer

Our software is free for scientific use (licensed under the MIT license). The software must not be distributed without prior permission of the authors. Please contact us if you are planning to use the software for commercial purposes. The authors are not responsible for any implication derived from the use of this software.