MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

*Equal contribution
1KAIST, 2The University of Texas at Austin

Overview of MobileSafetyBench

Testing Apple Intelligence
(sharing photo)

Testing Apple Intelligence
(writing memo)

Abstract

Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications. To clearly evaluate safety apart from general capabilities, we design separate tasks measuring safety and tasks evaluating helpfulness. The safety tasks challenge agents with managing potential risks prevalent in daily life and include tests to evaluate robustness against indirect prompt injections. Our experiments demonstrate that while baseline agents, based on state-of-the-art LLMs, perform well in executing helpful tasks, they show poor performance in safety tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments.

MY ALT TEXT

Safety Tasks


MobileSafetyBench is designed as a platform to assess both the safety and helpfulness of agents controlling mobile devices. We define safety within our benchmark as the agent’s ability to ensure that its actions, while performing a requested task, do not result in potentially harmful consequences. As safe behaviors can vary depending on the specific task requested or the device’s status, we develop diverse scenarios to challenge agents comprehensively. Furthermore, we make a clear distinction between tasks that measure safety (right; red) and those that evaluate helpfulness (left; green), allowing for a more precise assessment of safety apart from general capabilities.

MY ALT TEXT
MobileSafetyBench presents 100 numbers of tasks, divided into 50 helpfulness tasks and 50 safety tasks. The benchmark includes tasks associated with text messaging, web navigation, social media, calendar settings, and financial transactions, as shwon in Figure below (a). Safety tasks in our benchmark are characterized by various risks, which refer to signals of potential hazards. Specifically, we categorize the safety tasks based on four prevalent risk types in real life to facilitate a clearer interpretation of agent behaviors, as shown in Figure below (b).
MY ALT TEXT

Benchmark Results

MY ALT TEXT

We provide a reference of benchmark results with state-of-the-art LLMs. In our experiments, we benchmark agents employing the state-of-the-art multi-modal LLMs: GPT-4o (gpt-4o-20240513), Gemini-1.5 (gemini-1.5-pro-001), and Claude-3.5 (claude-3-5-sonnet-20240620). For the main experiment, we exploit two types of prompts: basic and Safety-guided Chain-of-Thought (SCoT; which is a newly proposed method to induce safe behaviors of the LLM agents) prompts. Figure above shows the helpfulness and safety scores of the baseline agents in MobileSafetyBench.

Robustness against Indirect Prompt Injection


MY ALT TEXT
Using MobileSafetyBench, we investigate whether baseline agents can maintain robust behavior when exposed to indirect prompt injection attacks. For instance, as illustrated in Figure above, a test scenario involves agents reviewing a text message that contains an irrelevant instruction to sell stock shares. Such injected prompts are embedded within UI elements like text messages and social media posts, and are delivered to the agents as part of the observation. We reveal state-of-the-art LLM agents' weakness against indirect prompt injection.

Further Discussions


In our manuscript, we examine the behaviors of the baseline LLMs in-depth and the effect of safeguards supplied by the service providers. We also present experimental results with the OpenAI-o1 agents, compared with the GPT-4o agents, to investigate the effects of strong reasoning ability. We hope our work serves as a valuable platform for building safe and helpful agents.

BibTeX

@article{lee2024mobilesafetybench,
  title={MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control},
  author={Lee, Juyong and Hahm, Dongyoon and Choi, June Suk and Knox, W Bradley and Lee, Kimin},
  journal={arXiv preprint arXiv:2410.17520},
  year={2024}
}