Author Archives: David Hsu

Prediction for Planning refers to the use of predictive models to anticipate future states of a system, enabling more informed decision-making in robotic planning. In domains like autonomous driving, effective planning depends not just on predicting how the world will evolve but also on understanding how the planner’s own actions influence that evolution.

Accurate state prediction is key to effective robot planning, yet conventional models assume fixed environmental dynamics. In interactive settings, robot actions influence environmental responses, creating a dynamics gap that disrupts planning accuracy. This challenge is especially evident in autonomous driving, where trajectory prediction ensures safe and efficient navigation.

Our research tackles this issue through two works: What Truly Matters in Trajectory Prediction for Autonomous Driving? and State Prediction for Planning: Closing the Interactive Dynamics Gap. These works reveal that prediction models, when used in planning, affect agent behavior in ways conventional evaluation methods miss. The first work highlights how the dynamics gap causes discrepancies between predictor accuracy on fixed datasets and real-world performance, emphasizing the need for task-driven evaluation. The second formalizes the compounding effects of prediction and planning errors and introduces a planner-specific learning objective to mitigate this gap, enabling safer, more robust decision-making.

What truly matters:

P. Tran, H. Wu, C. Yu, P. Cai, S. Zheng, and D. Hsu. What truly matters in trajectory prediction for autonomous driving? In Advances in Neural Information Processing Systems, 2023.
PDF

Prediction accuracy vs. driving performance. Our findings show a surprising lack of correlation between conventional prediction metrics (Black Curve) and real-world driving performance (Red Curve). Eight models are evaluated: CV, CA, KNN, S-KNN, HiVT, LaneGCN, LSTM, and S-LSTM.

In the autonomous driving system, trajectory prediction plays a vital role in ensuring safety and facilitating smooth navigation. However, we observe a substantial disparity between the accuracy of predictors on fixed datasets and their driving performance when used in downstream tasks. In this work, we reveal the overlooked significance of the dynamics gap, which plays a dominant role in this disparity. In real-world scenarios, prediction algorithms influence the behavior of autonomous vehicles, which, in turn, alter the behavior of other agents on the road. This interaction results in predictor-specific dynamics that directly impact prediction results. As other agents’ responses are predetermined on datasets, a significant dynamics gap arises between evaluations conducted on fixed datasets and actual driving scenarios. Furthermore, we explore the influence of various factors, beyond prediction accuracy, on the remaining disparity between prediction performance and driving performance. The findings illustrate the significance of predictors’ computational efficiency in real-time tasks, and its trade-off with prediction accuracy in reflecting driving performance. In summary, we demonstrate that an interactive, task-driven evaluation protocol for trajectory prediction is crucial to reflect its efficacy for autonomous driving.

Closing the Interactive Dynamics Gap:

H. Wu, C. Yu, Y. Xu, D. Hsu, and S. Zheng, State Prediction for Planning: Closing the Interactive Dynamics Gap

The prediction model trained with our proposed approach produces driving behavior that is more similar to that of a human expert (Right).

More broadly, effective robot planning requires state prediction models that account for interactive dynamics. Conventional prediction models assume fixed environment dynamics. However, in an interactive setting, robot actions may alter environment dynamics; the future system states no longer follow a fixed distribution, leading to the dynamics gap that violates the i.i.d. assumption of state prediction. In this work, we first analyze how prediction and planning errors interleave over time and establish theoretical bounds on their compounding effect. Next, we introduce a planner-specific objective for learning the prediction model. We validate our approach through both simulated and real-world autonomous driving experiments, demonstrating that our approach significantly improves prediction accuracy, enhances decision robustness, and leads to safer, smoother planning outcomes. One key insight of this work is that the state prediction is planner-specific in interactive robotic systems.

Imagine a future where distance no longer constrains our ability to manage household tasks. Picture a robot assistant capable of remotely interpreting spoken commands and gestures to check your refrigerator or reheat a meal before you get home. Such a robotic system would fundamentally change the way we interact with our homes, bringing a new level of convenience and efficiency to daily life. In this project, we introduce Robi Butler, a multimodal interaction system that enables seamless communication between remote users and household robots to execute various household tasks.

Robi Butler:

A. Xiao, N. Janaka, T. Hu, A. Gupta, K. Li, C. Yu, and D. Hsu. Robi Butler: Remote multimodal interactions with household robot assistant. In IEEE Int. Conf. on Robotics & Automation, 2025.
PDF | Homepage

Robi Butler allows the human user to monitor its environment from a first- person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multi-step action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

Imagine this, you walk into your kitchen and say to your home assistant, “Set up the table for breakfast!” Without any further instructions, your robot lays out the plates, utensils, and food items in ways that meet the dining conventions, creating an aesthetically pleasing setting—just as you had in mind.

Instruction: “Set up a table for breakfast.”

For these robots to be truly helpful, they need to understand and act upon natural human instructions without exhaustive programming. This capability would make robots accessible to everyone, not just those with technical expertise, and would greatly enhance their usefulness in everyday life. Our research aims to bridge this gap by developing a system that allows robots to interpret and execute under-specified, natural language instructions for functional object arrangement (FORM).

Set It Up:

Y. Xu, J. Mao, Y. Du, T. Lozano-Perez, L.P. Kaelbling, and D. Hsu. “Set it up!”: Functional object arrangement with compositional generative models. In Proc. Robotics: Science & Systems, 2024.
PDF | Website

We present “Set It Up”, a neuro-symbolic framework that learns to specify the goal poses of objects from a few training examples and a structured natural-language task specification. Set It Up uses a grounding graph, which is composed of abstract spatial relations among objects, (e.g., left-of), as its intermediate representation. This decomposes the FORM problem into two stages: (i) predicting this graph among objects and (ii) predicting object poses given the grounding graph. For (i), Set It Up leverages large language models (LLMs) to induce Python programs from a few training examples and a task specification. This program can be executed to generate grounding graphs in novel scenarios. For (ii), Set It Up pre-trains a collection of diffusion models to capture primitive spatial relations and online

People have long imagined intelligent home service robots effortlessly handling daily chores like folding T-shirts or neatly hanging skirts. Yet creating a general-purpose robot capable of manipulating diverse clothes remains a significant challenge. Clothes, inherently deformable, pose difficulties due to their complex and changing shapes. Although recent advances have improved robots’ abilities to fold and flatten clothes, these skills often only apply to specific clothes or tasks. Our research addresses this issue through a novel representation of clothes—semantic keypoints, which capture the essential features of clothes through some sparse keypoints with semantic meaning. Integrating semantic keypoints with foundation models, we develop a general-purpose clothes manipulation method, which can be applied to a wide variety of clothes categories and manipulation tasks.

CLASP:

Y. Deng, and D. Hsu. General-purpose clothes manipulation with semantic keypoints. In IEEE Int. Conf. on Robotics & Automation, 2025.
PDF

Clothes manipulation is a critical capability for home service robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints—e.g., ”right shoulder”, ”left sleeve”, etc.—a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a dual-arm system on four distinct tasks—folding, flattening, hanging, and placing—confirm CLASP’s performance on a real robot.

Navigation in the open world is an elusive grail of robotics, owing to the sheer diversity of environments, agents and scenarios a robot can encounter. What does a robot need to navigate zero-shot to a goal, when deployed anywhere in the world?

We aim to push robot navigation into the open world, by making advances in 3 key directions. When designing the robot “in the factory”, we focus on improving its ability to generalise to novel scenarios and tasks in the real world, particularly by leveraging progress in foundation models. Yet training data is bounded, and robots will inevitably encounter challenging out-of-distribution scenarios “in the wild”. Managing such scenarios is critical to ensuring the robustness of open world navigation. To do so, we can exploit priors available to the robot prior to deployment – e.g., scene-specific priors like floor-plans or language directions – to guide navigation. If the robot still ultimately lands itself in a failure state, it needs to have the ability to identify, analyse and take reasonable actions to handle the failure.

Open Scene Graphs

J. Loo, Z. Wu, D. Hsu, Open Scene Graphs for Open World Object-goal Navigation
PDF | Video | Website

How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation is needed to connect them into a complete robot system. We address this with Open Scene Graphs (OSGs), a topo-semantic representation that stores and organises open-set scene information for these models. OSGs are a generalisation of existing scene graphs to handle diverse indoor environments, using customisable OSG schemas to enable flexible structure and semantics across environments ranging from homes to supermarkets to offices. We integrate foundation models and OSGs into the OSG Navigator system for Open World Object-Goal Navigation, which is capable of searching for open-set objects specified in natural language, while generalising zero-shot across diverse environments and embodiments. Our OSGs enhance reasoning with Large Language Models (LLM), enabling robust object-goal navigation outperforming existing LLM approaches. Through simulation and real-world experiments, we validate OSG Navigator ’s generalisation across varied environments, robots and novel instructions.

Scene Action Maps

J. Loo and D. Hsu, Scene action maps: Behavioural maps for navigation without metric information. In In Proc. IEEE Int. Conf. on Robotics & Automation, 2024.
PDF | Video | Website

Humans are remarkable in their ability to navigate without metric information. We can read abstract 2D maps, such as floor-plans or hand-drawn sketches, and use them to navigate in unseen rich 3D environments, without requiring prior traversals to map out these scenes in detail. We posit that this is enabled by the ability to represent the environment abstractly as interconnected navigational behaviours, e.g., “follow the corridor” or “turn right”, while avoiding detailed, accurate spatial information at the metric level. We introduce the Scene Action Map (SAM), a behavioural topological graph, and propose a learnable map-reading method, which parses a variety of 2D maps into SAMs. Map-reading extracts salient information about navigational behaviours from the overlooked wealth of pre-existing, abstract and inaccurate maps, ranging from floor-plans to sketches. We evaluate the performance of SAMs for navigation, by building and deploying a behavioural navigation stack on a quadrupedal robot.