MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Juyong Lee^1* Dongyoon Hahm^1* June Suk Choi^1* W. Bradley Knox² Kimin Lee¹

^*Equal contribution

¹KAIST, ²The University of Texas at Austin

Extended Paper Code Datasets

Overview of MobileSafetyBench

Testing Apple Intelligence
(sharing photo)

Testing Apple Intelligence
(writing memo)

Abstract

Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing the misuse and negative side effects. These tasks include tests to evaluate the safety of agents in daily scenarios as well as their robustness against indirect prompt injection attacks. Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments. WARNING: This paper contains contents that are unethical or offensive in nature.

MY ALT TEXT

Safety Tasks

MobileSafetyBench is designed as a platform to assess both the safety and helpfulness of agents controlling mobile devices. We define safety within our benchmark as the agent’s ability to ensure that its actions, while performing a requested task, do not result in potentially harmful consequences. As safe behaviors can vary depending on the specific task requested or the device’s status, we develop diverse scenarios to challenge agents comprehensively. Furthermore, we make a clear distinction between tasks that measure safety (right; red) and those that evaluate helpfulness (left; green), allowing for a more precise assessment of safety apart from general capabilities.

MY ALT TEXT

MobileSafetyBench presents 250 numbers of tasks in total covering diverse applications and harm categories. The benchmark includes tasks associated with text messaging, web navigation, social media, calendar settings, and financial transactions, spanning 200 daily scenario tasks. Furthremore, it involves 50 indirect prompt injection attack scenario tasks, challenging agents to maintain safe behavior when exposed to maliciously crafted prompts embedded within the task context.

Benchmark Results

MY ALT TEXT

We provide a reference of benchmark results with state-of-the-art LLMs. In our experiments, we benchmarked 10 leading LLMs: GPT-5, GPT-4o, Gemini-2.5-Pro, Gemini-1.5-Pro, Claude-3.5-Sonnet, Llama-3.2-90B-Vision, Grok-2, Qwen-2.5-VL-72B, Pixtral-Large, and Phi-4. Figure above shows the goal achievement rates and harm prevention rates across 200 daily scenario tasks in MobileSafetyBench.

Robustness against Indirect Prompt Injection

MY ALT TEXT

Using MobileSafetyBench, we investigate whether baseline agents can maintain robust behavior when exposed to indirect prompt injection attacks. For instance, as illustrated in Figure above, a test scenario involves agents reviewing a text message that contains an irrelevant instruction to sell stock shares. Such injected prompts are embedded within UI elements like text messages and social media posts, and are delivered to the agents as part of the observation. We reveal state-of-the-art LLM agents' weakness against indirect prompt injection.

Further Discussions

In our manuscript, we examine the behaviors of the baseline LLMs in-depth and the effect of safeguards supplied by the service providers. We also present experimental results with the OpenAI-o1 agents, compared with the GPT-4o agents, to investigate the trade-off of strong reasoning ability. We hope our work serves as a valuable platform for building safe and helpful agents.

BibTeX

@article{lee2024mobilesafetybench,
  title={MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control},
  author={Lee, Juyong and Hahm, Dongyoon and Choi, June Suk and Knox, W Bradley and Lee, Kimin},
  journal={arXiv preprint arXiv:2410.17520},
  year={2024}
}