Category Archives: Research

Prediction for Planning refers to the use of predictive models to anticipate future states of a system, enabling more informed decision-making in robotic planning. In domains like autonomous driving, effective planning depends not just on predicting how the world will evolve but also on understanding how the planner’s own actions influence that evolution.

Accurate state prediction is key to effective robot planning, yet conventional models assume fixed environmental dynamics. In interactive settings, robot actions influence environmental responses, creating a dynamics gap that disrupts planning accuracy. This challenge is especially evident in autonomous driving, where trajectory prediction ensures safe and efficient navigation.

Our research tackles this issue through two works: What Truly Matters in Trajectory Prediction for Autonomous Driving? and State Prediction for Planning: Closing the Interactive Dynamics Gap. These works reveal that prediction models, when used in planning, affect agent behavior in ways conventional evaluation methods miss. The first work highlights how the dynamics gap causes discrepancies between predictor accuracy on fixed datasets and real-world performance, emphasizing the need for task-driven evaluation. The second formalizes the compounding effects of prediction and planning errors and introduces a planner-specific learning objective to mitigate this gap, enabling safer, more robust decision-making.

What truly matters:

P. Tran, H. Wu, C. Yu, P. Cai, S. Zheng, and D. Hsu. What truly matters in trajectory prediction for autonomous driving? In Advances in Neural Information Processing Systems, 2023.
PDF

Prediction accuracy vs. driving performance. Our findings show a surprising lack of correlation between conventional prediction metrics (Black Curve) and real-world driving performance (Red Curve). Eight models are evaluated: CV, CA, KNN, S-KNN, HiVT, LaneGCN, LSTM, and S-LSTM.

In the autonomous driving system, trajectory prediction plays a vital role in ensuring safety and facilitating smooth navigation. However, we observe a substantial disparity between the accuracy of predictors on fixed datasets and their driving performance when used in downstream tasks. In this work, we reveal the overlooked significance of the dynamics gap, which plays a dominant role in this disparity. In real-world scenarios, prediction algorithms influence the behavior of autonomous vehicles, which, in turn, alter the behavior of other agents on the road. This interaction results in predictor-specific dynamics that directly impact prediction results. As other agents’ responses are predetermined on datasets, a significant dynamics gap arises between evaluations conducted on fixed datasets and actual driving scenarios. Furthermore, we explore the influence of various factors, beyond prediction accuracy, on the remaining disparity between prediction performance and driving performance. The findings illustrate the significance of predictors’ computational efficiency in real-time tasks, and its trade-off with prediction accuracy in reflecting driving performance. In summary, we demonstrate that an interactive, task-driven evaluation protocol for trajectory prediction is crucial to reflect its efficacy for autonomous driving.

Closing the Interactive Dynamics Gap:

H. Wu, C. Yu, Y. Xu, D. Hsu, and S. Zheng, State Prediction for Planning: Closing the Interactive Dynamics Gap

The prediction model trained with our proposed approach produces driving behavior that is more similar to that of a human expert (Right).

More broadly, effective robot planning requires state prediction models that account for interactive dynamics. Conventional prediction models assume fixed environment dynamics. However, in an interactive setting, robot actions may alter environment dynamics; the future system states no longer follow a fixed distribution, leading to the dynamics gap that violates the i.i.d. assumption of state prediction. In this work, we first analyze how prediction and planning errors interleave over time and establish theoretical bounds on their compounding effect. Next, we introduce a planner-specific objective for learning the prediction model. We validate our approach through both simulated and real-world autonomous driving experiments, demonstrating that our approach significantly improves prediction accuracy, enhances decision robustness, and leads to safer, smoother planning outcomes. One key insight of this work is that the state prediction is planner-specific in interactive robotic systems.

Imagine a future where distance no longer constrains our ability to manage household tasks. Picture a robot assistant capable of remotely interpreting spoken commands and gestures to check your refrigerator or reheat a meal before you get home. Such a robotic system would fundamentally change the way we interact with our homes, bringing a new level of convenience and efficiency to daily life. In this project, we introduce Robi Butler, a multimodal interaction system that enables seamless communication between remote users and household robots to execute various household tasks.

Robi Butler:

A. Xiao, N. Janaka, T. Hu, A. Gupta, K. Li, C. Yu, and D. Hsu. Robi Butler: Remote multimodal interactions with household robot assistant. In IEEE Int. Conf. on Robotics & Automation, 2025.
PDF | Homepage

Robi Butler allows the human user to monitor its environment from a first- person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multi-step action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

Imagine this, you walk into your kitchen and say to your home assistant, “Set up the table for breakfast!” Without any further instructions, your robot lays out the plates, utensils, and food items in ways that meet the dining conventions, creating an aesthetically pleasing setting—just as you had in mind.

Instruction: “Set up a table for breakfast.”

For these robots to be truly helpful, they need to understand and act upon natural human instructions without exhaustive programming. This capability would make robots accessible to everyone, not just those with technical expertise, and would greatly enhance their usefulness in everyday life. Our research aims to bridge this gap by developing a system that allows robots to interpret and execute under-specified, natural language instructions for functional object arrangement (FORM).

Set It Up:

Y. Xu, J. Mao, Y. Du, T. Lozano-Perez, L.P. Kaelbling, and D. Hsu. “Set it up!”: Functional object arrangement with compositional generative models. In Proc. Robotics: Science & Systems, 2024.
PDF | Website

We present “Set It Up”, a neuro-symbolic framework that learns to specify the goal poses of objects from a few training examples and a structured natural-language task specification. Set It Up uses a grounding graph, which is composed of abstract spatial relations among objects, (e.g., left-of), as its intermediate representation. This decomposes the FORM problem into two stages: (i) predicting this graph among objects and (ii) predicting object poses given the grounding graph. For (i), Set It Up leverages large language models (LLMs) to induce Python programs from a few training examples and a task specification. This program can be executed to generate grounding graphs in novel scenarios. For (ii), Set It Up pre-trains a collection of diffusion models to capture primitive spatial relations and online

People have long imagined intelligent home service robots effortlessly handling daily chores like folding T-shirts or neatly hanging skirts. Yet creating a general-purpose robot capable of manipulating diverse clothes remains a significant challenge. Clothes, inherently deformable, pose difficulties due to their complex and changing shapes. Although recent advances have improved robots’ abilities to fold and flatten clothes, these skills often only apply to specific clothes or tasks. Our research addresses this issue through a novel representation of clothes—semantic keypoints, which capture the essential features of clothes through some sparse keypoints with semantic meaning. Integrating semantic keypoints with foundation models, we develop a general-purpose clothes manipulation method, which can be applied to a wide variety of clothes categories and manipulation tasks.

CLASP:

Y. Deng, and D. Hsu. General-purpose clothes manipulation with semantic keypoints. In IEEE Int. Conf. on Robotics & Automation, 2025.
PDF

Clothes manipulation is a critical capability for home service robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints—e.g., ”right shoulder”, ”left sleeve”, etc.—a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a dual-arm system on four distinct tasks—folding, flattening, hanging, and placing—confirm CLASP’s performance on a real robot.

Navigation in the open world is an elusive grail of robotics, owing to the sheer diversity of environments, agents and scenarios a robot can encounter. What does a robot need to navigate zero-shot to a goal, when deployed anywhere in the world?

We aim to push robot navigation into the open world, by making advances in 3 key directions. When designing the robot “in the factory”, we focus on improving its ability to generalise to novel scenarios and tasks in the real world, particularly by leveraging progress in foundation models. Yet training data is bounded, and robots will inevitably encounter challenging out-of-distribution scenarios “in the wild”. Managing such scenarios is critical to ensuring the robustness of open world navigation. To do so, we can exploit priors available to the robot prior to deployment – e.g., scene-specific priors like floor-plans or language directions – to guide navigation. If the robot still ultimately lands itself in a failure state, it needs to have the ability to identify, analyse and take reasonable actions to handle the failure.

Open Scene Graphs

J. Loo, Z. Wu, D. Hsu, Open Scene Graphs for Open World Object-goal Navigation
PDF | Video | Website

How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation is needed to connect them into a complete robot system. We address this with Open Scene Graphs (OSGs), a topo-semantic representation that stores and organises open-set scene information for these models. OSGs are a generalisation of existing scene graphs to handle diverse indoor environments, using customisable OSG schemas to enable flexible structure and semantics across environments ranging from homes to supermarkets to offices. We integrate foundation models and OSGs into the OSG Navigator system for Open World Object-Goal Navigation, which is capable of searching for open-set objects specified in natural language, while generalising zero-shot across diverse environments and embodiments. Our OSGs enhance reasoning with Large Language Models (LLM), enabling robust object-goal navigation outperforming existing LLM approaches. Through simulation and real-world experiments, we validate OSG Navigator ’s generalisation across varied environments, robots and novel instructions.

Scene Action Maps

J. Loo and D. Hsu, Scene action maps: Behavioural maps for navigation without metric information. In In Proc. IEEE Int. Conf. on Robotics & Automation, 2024.
PDF | Video | Website

Humans are remarkable in their ability to navigate without metric information. We can read abstract 2D maps, such as floor-plans or hand-drawn sketches, and use them to navigate in unseen rich 3D environments, without requiring prior traversals to map out these scenes in detail. We posit that this is enabled by the ability to represent the environment abstractly as interconnected navigational behaviours, e.g., “follow the corridor” or “turn right”, while avoiding detailed, accurate spatial information at the metric level. We introduce the Scene Action Map (SAM), a behavioural topological graph, and propose a learnable map-reading method, which parses a variety of 2D maps into SAMs. Map-reading extracts salient information about navigational behaviours from the overlooked wealth of pre-existing, abstract and inaccurate maps, ranging from floor-plans to sketches. We evaluate the performance of SAMs for navigation, by building and deploying a behavioural navigation stack on a quadrupedal robot.

This research examines particle-based object manipulation. We start with Particle-based Object Manipulation (PROMPT), designed for the manipulation of unknown rigid objects. PROMPT leverages a particle-based object representation to bridge visual perception and robot control. We expand this particle representation to deformable objects and introduce DaXBench, a differentiable simulation framework to benchmark deformable object manipulation (DOM) techniques. DaxBench enables deeper exploration and systematic comparison of DOM techniques. Finally, we present Differentiable Particles (DiPac), a new DOM algorithm, using the particle presentation. The differentiable particle dynamics aids in the efficient estimation of dynamics parameters and action optimization. Together, these new tools address the challenge of robot manipulation of unknown deformable objects, paving the way for new progress in the field.

Prompt

S. Chen, X. Ma, Y. Lu, D. Hsu. Ab initio particle-based object manipulation. Proceedings of Robotics: Science and Systems, RSS 2021.
PDF | Code

We present Particle-based Object Manipulation (Prompt), a new approach to robot manipulation of novel objects ab initio, without prior object models or pre-training on a large object data set. The key element of Prompt is a particle-based object representation, in which each particle represents a point in the object, the local geometric, physical, and other features of the point, and also its relation with other particles. Like the model-based analytic approaches to manipulation, the particle representation enables the robot to reason about the object’s geometry and dynamics in order to choose suitable manipulation actions. Like the data-driven approaches, the particle representation is learned online in real-time from visual sensor input, specifically, multi-view RGB images. The particle representation thus connects visual perception with robot control. Prompt combines the benefits of both model-based reasoning and data-driven learning.

DaxBench

S. Chen*, Y. Xu*, C. Yu*, L. Li, X. Ma, Z. Xu, D. Hsu. DaxBench: Benchmarking Deformable Object Manipulation with Differentiable Physics. Proceedings of The Eleventh International Conference on Learning Representations, ICLR 2023 (Oral). (*Co-first author)

PDF | Code | Website

We extend the particle representation to deformable object manipulation with differentiable dynamics. Deformable object manipulation (DOM) is a long-standing challenge in robotics and has attracted significant interest recently. This paper presents DaXBench, a differentiable simulation framework for DOM. While existing work often focuses on a specific type of deformable objects, DaXBench supports fluid, rope, cloth …; it provides a general-purpose benchmark to evaluate widely different DOM methods, including planning, imitation learning, and reinforcement learning. DaXBench hence serves as a working bench to facilitate research on deformable object manipulation using particles.

DiPac

S. Chen, Y. Xu, C. Yu, L. Li, X. Ma, D. Hsu. Differentiable Particles for Deformable Object Manipulation.

PDF | Code

Furthermore, we utilize the concept of differentiable particles to manage a wide variety of deformable objects. The manipulation of such objects, including rope, cloth, or beans, presents a significant challenge due to their extensive degrees of freedom and complex non-linear dynamics. This paper introduces Differentiable Particles (DiPac), a novel algorithm designed for the manipulation of deformable objects. DiPac interprets a deformable object as a collection of particles and employs a differentiable particle dynamics simulator to rationalize robotic manipulation. DiPac uses a single representation – particles – for a diverse array of objects: scattered beans, rope, T-shirts, and so forth. The use of differentiable dynamics empowers DiPac to efficiently (i) estimate the dynamics parameters, effectively reducing the simulation-to-real gap, and (ii) select the best action by backpropagating the gradient along sampled trajectories.

Deep Visual Navigation under Partial Observability

How do humans navigate? We navigate with almost exclusive visual sensing and coarse floor plans. To reach a destination, we demonstrate a diverse set of skills, such as obstacle avoidance, and we use tools, such as buses and elevators, to traverse through different locations. All these are not yet possible for robots. To get closer to human-level navigation performance, we propose a controller that is capable of learning complex policies from data.

DECISION: Deep rEcurrent Controller for vISual navigatiON

B. Ai, W. Gao, Vinay, and D. Hsu. Deep Visual Navigation under Partial Observability. In International Conference on Robotics and Automation (ICRA), 2022. [paper][code][video]

The key lies in designing the structural components that allow the controller to learn the desired capabilities. Here we identified three challenges:

(i) Partial observability: The robot may not see blind-spot objects, or is unable to detect features of interest, e.g., the intention of a moving pedestrian.

(ii) Multimodal behaviors: Human navigation behaviors are multimodal in nature, and the behaviors are dependent on both local environments and the high-level navigation objective.

(iii) Visual complexity: Due to the high dimensionality of raw pixels, different scenes could appear dramatically different across environments, which makes traditional model-based approaches brittle.

To resolve the challenges, we propose two key structure designs

(i) Multi-scale temporal modeling: We use spatial memory modules to capture both low-level motions and high-level temporal semantics that are useful to the control. The rich history information can compensate for the partial observations.

(ii) Multimodal actions: We extend the idea of Mixture Density Networks (MDNs) to temporal reasoning. Specifically, we use independent memory modules for different modes to preserve the distinction of modes.

We collected a real-world human demonstration dataset consisting of 410K timesteps and train the controller end-to-end. Our DECISION controller significantly outperforms CNNs and LSTMs.

The controller was first deployed on our Boston Dynamics Spot robot in April 2022. It has been navigating autonomously for more than 150km at the time of writing. It has been incrementally improved over time, as an integral component of the GoAnywhere@NUS project.

One feature of this work is that all experimental results are obtained in the real world. Here is our demo video, showing our Spot robot traversing many different locations on our university campus.

Our goal is to exploit the potential of particle representation to enable robots to handle general manipulation tasks, including rigid and deformable objects and liquids.

Ab Initio Particle-based Object Manipulation

S. Chen, X. Ma, Y. Lu, D. Hsu. Ab Initio Particle-based Object Manipulation. In Robotics: Science and Systems, RSS, 2021. [PDF][Code]

This paper presents Particle-based Object Manipulation (Prompt), a new approach to robot manipulation of novel objects ab initio, without prior object models or pre-training on a large object data set. The key element of Prompt is a particle-based object representation, in which each particle represents a point in the object, the local geometric, physical, and other features of the point, and also its relation with other particles. Like the model-based analytic approaches to manipulation, the particle representation enables the robot to reason about the object’s geometry and dynamics in order to choose suitable manipulation actions. Like the data-driven approaches, the particle representation is learned online in real-time from visual sensor input, specifically, multi-view RGB images. The particle representation thus connects visual perception with robot control. Prompt combines the benefits of both model-based reasoning and data-driven learning. We show empirically that Prompt successfully handles a variety of everyday objects, some of which are transparent. It handles various manipulation tasks, including grasping, pushing, etc,. Our experiments also show that Prompt outperforms a state-of-the-art data-driven grasping method on the daily objects, even though it does not use any offline training data.

Real-world robots often face a stochastic and partially observable environment. POMDP planning offers a principled approach to handle such uncertainties. However, POMDPs are also well-known for their computational complexity that grows exponentially with the problem scale and the planning horizon, namely, the “curse of dimension” and the “curse of history”. We aim to solve large-scale POMDP planning efficiently and scale-up to complex real-world tasks. The key to our solution includes: (i) one can leverage massive parallelization and powerful hardware to mitigate the “curse of dimension”, and (ii) one can integrate POMDP planning with learning to overcome the “curse of history”.

A perfect example of large-scale real-world tasks is Crowd-Driving: driving among an unregulated crowd of heterogeneous traffic agents. In crowd-driving, the robot vehicle must contend with a large-scale and highly interactive environment. Such complex environments require sophisticated long-term planning to achieve human-level performance. However, efficient planning is often prohibited due to real-time constraints. Shown below is a 3-mins talk introducing the crowd-driving problem and discussing how the problem can be modeled and solved using a combination of planning and learning. This video also summarizes the core technical approaches to be introduced in this post:

HyP-DESPOT

Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee. HyP-DESPOT: A Hybrid Parallel Algorithm for Online Planning under Uncertainty. Int. J. Robotics Research, 40(2–3), 2021.
PDF | code

Hybrid Parallel DESPOT (HyP-DESPOT) is a massively parallel belief tree search algorithm that leverages both CPU and GPU parallelization in order to achieve real-time planning performance for complex tasks with large state, action, and observation spaces. In multi-core CPUs, HyP-DESPOT performs parallel DESPOT tree search by simultaneously traversing multiple independent paths; In the GPU, HyP-DESPOT performs parallel Monte Carlo simulations at the leaf nodes of the search tree. HyP-DESPOT provably converges in finite time under moderate conditions and guarantees near-optimality of the solution. In practice, HyP-DESPOT speeds up online planning by up to a factor of several hundred in several challenging robotic tasks.

LeTS-Drive

Panpan Cai, Yuanfu Luo, Aseem Saxena, David Hsu, and Wee Sun Lee. LeTS-Drive: Driving in a Crowd by Learning from Tree Search. In Proc. Robotics: Science & Systems, 2019.

LeTS-Drive is a crowd-driving algorithm that integrates online POMDP planning with deep learning. The core idea is to constrain belief tree search to short-term futures and use learning for long-term futures. It consists of two phases. In the offline phase, we learn a policy and the corresponding value function by imitating the belief tree search expert. In the online phase, LeTS-Drive uses the learned policy and value function to inform and guide the online belief tree search. LeTS-Drive leverages the robustness of planning and the runtime efficiency of learning to enhance the performance of both. By integrating planning and learning, LeTS-Drive outperforms either planning or imitation learning alone and develops sophisticated crowd-driving skills.

LeTS-Drive-Auto

Panpan Cai, David Hsu, Think Locally, Learn Globally. to be uploaded to Arxiv.

LeTS-Drive-Auto advances LeTS-Drive by integrating planning and learning in a close-loop. Similar to LeTS-Drive, it learns a policy network and a value network as representations of global priors. Differently, LeTS-Drive-Auto builds mutual communication between the planner and the learner by 1) guiding the online planner using learned global priors; 2) learning the global priors from data sent back from the guided planner. LeTS-Drive-Auto is a new reinforcement learning algorithm that the planner serves as the policy improvement operator and evolves together with the learner. By integrating planning and learning in a close-loop, LeTS-Drive-Auto achieves superior driving performance in crowded urban environments in simulation, outperforming either planning or learning alone, and largely exceeds the capability of open-loop integration (LeTS-Drive).

We aim to model, predict, and simulate the motion of unregulated crowds of traffic participants. We take a geometric approach and model traffic agent motion as a constrained optimization problem in the velocity space of agents. The objective of the problem is to maximize the cooperative navigation efficiency for all agents, while the constraints are induced by collision avoidance, vehicle kinematics, and the road context. We have developed two motion models for two typical scenarios, including crowds of pedestrians in open spaces and heterogenous traffic crowds on urban roads. Making use of these models, a planning system can perform more accurate long-term predictions and thus enhance the planning performance. We have further released an open-source simulator that uses these models to simulate crowd-driving in real-world urban maps. We envision the simulator to facilitate training, testing, and development of crowd-driving algorithms.

PORCA

Yuanfu Luo, Panpan Cai, Aniket Bera, David Hsu, Wee Sun Lee, and Dinesh Manocha. PORCA: Modeling and planning for autonomous driving among many pedestrians. IEEE Robotics and Automation Letters, 2018. [PDF]

This paper presents a planning system for autonomous driving among many pedestrians. A key ingredient of our approach is PORCA, a pedestrian motion prediction model that accounts for both a pedestrian’s global navigation intention and local interactions with the vehicle and other pedestrians. Since the autonomous vehicle does not know the pedestrian’s intention a priori, it requires a planning algorithm to reason about the uncertainty on pedestrian intentions. Our planning system combines a POMDP algorithm with the pedestrian motion model and runs in near real-time. Experiments show that it enables a robot vehicle to drive safely, efficiently, and smoothly among a dense crowd of pedestrians.

GAMMA

Yuanfu Luo, Panpan Cai, David Hsu, and Wee Sun Lee. GAMMA: A general agent motion prediction model for autonomous driving. Arxiv, 2019. [PDF][Code]

Urban environments usually contain mixed traffic of heterogeneous agents such as pedestrians, bicycles, cars, buses, etc.. In such environments, traffic motion prediction becomes extremely challenging because of the diverse dynamics and geometry of traffic agents, complex road conditions, and intensive interactions among the agents. In this paper, we proposed GAMMA, a general agent motion prediction model for autonomous driving, that can predict the motion of heterogeneous traffic agents with different kinematics, geometry, human agents’ inner states, etc.. GAMMA formalizes motion prediction as geometric optimization in the velocity space, and integrates physical constraints and human inner states into this unified framework. Our results show that GAMMA outperforms both geometric and learning-based approaches significantly on diverse real-world datasets.

STAR

Cunjun Yu*, Xiao Ma*, Jiawei Ren, Haiyu Zhao, Shuai Yi. Spatio-Temporal Grpah Transformer Networks for Pedestrian Trajectory Prediction. in Proc. European Conf. on Computer Vision, 2020. [PDF][Code][Project]

Understanding crowd motion dynamics is critical to real-world applications, e.g., surveillance systems and autonomous driving. This is challenging because it requires effectively modeling the socially aware crowd spatial interaction and complex temporal dependencies. We believe attention is the most important factor for trajectory prediction. In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a novel Transformer-based graph convolution mechanism. The inter-graph temporal dependencies are modeled by separate temporal Transformers. STAR captures complex spatio-temporal interactions by interleaving between spatial and temporal Transformers. To calibrate the temporal prediction for the long-lasting effect of disappeared pedestrians, we introduce a read-writable external memory module, consistently being updated by the temporal Transformer. We show that with only attention mechanism, STAR achieves state-of-the-art performance on 5 commonly used real-world pedestrian prediction datasets.

SUMMIT

Panpan Cai*, Yiyuan Lee*, Yuanfu Luo, David Hsu. SUMMIT: A Simulator for Urban Driving in Massive Mixed Traffic. in Proc. IEEE Int. Conf. on Robotics & Automation, 2020. [PDF][Code]

This paper presents SUMMIT, a high-fidelity simulator that facilitates the development and testing of crowd-driving algorithms. By leveraging the open-source OpenStreetMap map database and the GAMMA motion model developed in our earlier work, SUMMIT simulates dense, unregulated urban traffic for heterogeneous agents at any worldwide location that OpenStreetMap supports. SUMMIT is built as an extension of CARLA and inherits from it the physics and visual realism for autonomous driving simulation. SUMMIT supports a wide range of applications, including perception, vehicle control and planning, and end-to-end learning. We provide real-world benchmark scenes to show that SUMMIT generates complex, realistic traffic behaviors in challenging crowd-driving settings. The simulator also comes with a context-aware POMDP planner as a driving expert and a reference to future crowd-driving algorithms.

We aim at robust reinforcement learning (RL) for tasks with complex partial observations. While existing RL algorithms have achieved great success in simulated environments, such as Atari games, Go, and even Dota, generalizing them to realistic environments with complex partial observations remains challenging. The key to our approach is to learn from the partial observations a robust and compact latent state representation. Specifically, we handle partial observability by combining the particle filter algorithm with recurrent neural networks. We tackle complex observations through efficient discriminative model learning, which focuses on learning observations required for action selection rather the entire high-dimensional observation space.

Particle Filter Recurrent Neural Networks

X. Ma, P. Karkus, D. Hsu, and W.S. Lee. Particle filter recurrent neural networks. In Proc. AAAI Conf. on Artificial Intelligence, 2020. [PDF][Code]

Recurrent neural networks (RNNs) have been extraordinarily successful for prediction with sequential data. To tackle highly variable and noisy real-world data, we introduce Particle Filter Recurrent Neural Networks (PF-RNNs), a new RNN family that explicitly models uncertainty in its internal structure: while an RNN relies on a long, deterministic latent state vector, a PF-RNN maintains a latent state distribution, approximated as a set of particles. For effective learning, we provide a fully differentiable particle filter algorithm that updates the PF-RNN latent state distribution according to the Bayes rule. Experiments demonstrate that the proposed PF-RNNs outperform the corresponding standard gated RNNs on a synthetic robot localization dataset and 10 real-world sequence prediction datasets for text classification, stock price prediction, etc.

Discriminative Particle Filter Reinforcement Learning

X. Ma, P. Karkus, D. Hsu, W.S. Lee, and N. Ye. Discriminative particle filter reinforcement learning for complex partial observations. In Proc. Int. Conf. on Learning Representations, 2020. [PDF][Project][Code]

Real-world decision making often requires reasoning in a partially observable environment using information obtained from complex visual observations — major challenges for deep reinforcement learning. In this paper, we introduce the Discriminative Particle Filter Reinforcement Learning (DPFRL), a reinforcement learning method that encodes a particle filter structure with learned discriminative transition and observation models in a neural network. The particle filter structure allows for reasoning with partial observations, and discriminative parameterization allows modeling only the information in the complex observations that are relevant for decision making. In experiments, we show that in most cases DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, as well as in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark that we introduce. We also show that DPFRL performs well when applied to a visual navigation domain with real-world data.

Contrastive Variational Reinforcement Learning

X. Ma, S. Chen, D. Hsu and W.S. Lee. Contrastive variational reinforcement learning for complex observations. In Proc. 4th Conf. on Robot Learning, 2020. [PDF][Project][Code]

Deep reinforcement learning (DRL) has achieved significant success in various robot tasks: manipulation, navigation, etc. However, complex visual observations in natural environments remains a major challenge. This paper presents Contrastive Variational Reinforcement Learning (CVRL), a model-based method that tackles complex visual observations in DRL. CVRL learns a contrastive variational model by maximizing the mutual information between latent states and observations discriminatively, through contrastive learning. It avoids modeling the complex observation space unnecessarily, as the commonly used generative observation model often does, and is significantly more robust. CVRL achieves comparable performance with state-of-the-art model-based DRL methods on standard Mujoco tasks. It significantly outperforms them on Natural Mujoco tasks and a robot box-pushing task with complex observations, e.g., dynamic shadows.

Natural Mujoco Tasks

Object manipulation in unstructured environments such as homes and offices requires decision making under uncertainty. We want to investigate how we can do general, fast and robust object manipulation under uncertainty in a principled manner.

Learning To Grasp Under Uncertainty Using POMDPs

N. P. Garg, D. Hsu, and W. S. Lee. Learning To Grasp Under Uncertainty Using POMDPs. In Proc. IEEE Int. Conf. on Robotics & Automation 2019.

Robust object grasping under uncertainty is an essential capability of service robots. Many existing approaches rely on far-field sensors, such as cameras, to compute a grasp pose and perform open-loop grasp after placing gripper under the pose. This often fails as a result of sensing or environment uncertainty. This paper presents a principled, general, and efficient approach to adaptive grasping, using both tactile and visual sensing as feedback. We first model adaptive grasping as a partially observable Markov decision process (POMDP), which handles uncertainty naturally. We solve the POMDP for sampled objects from a set, in order to generate data for learning. Finally, we train a grasp policy, represented as a deep recurrent neural network (RNN), in simulation through imitation learning. By combining model-based POMDP planning and imitation learning, the proposed approach achieves robustness under uncertainty, generalization over many objects, and fast execution. In particular, we show that modeling only a small sample of objects enables us to learn a robust strategy to grasp previously unseen objects of varying shapes and recover from failure over multiple steps. Experiments on the G3DB object dataset in simulation and a smaller object set with a real robot indicate promising results.

Push-Net: Deep Planar Pushing for Objects with Unknown Physical Properties

J.K. Li, D. Hsu, and W.S. Lee. Push-Net: Deep planar pushing for objects with unknown physical properties. In Proc. Robotics: Science & Systems, 2018.
PDF code

We introduce Push-Net, a deep recurrent neural network model, which enables a robot to push objects of unknown physical properties for re-positioning and re-orientation, using only visual camera images as input. The unknown physical properties is a major challenge for pushing. Push-Net overcomes the challenge by tracking a history of push interactions with an LSTM module and training an auxiliary objective function that estimates an object’s center of mass. We trained Push-Net entirely in simulation and tested it extensively on many different objects in both simulation and on two real robots, a Fetch arm and a Kinova MICO arm. Experiments suggest that Push-Net is robust and efficient. It achieved over 97% success rate in simulation on average and succeeded in all real robot experiments with a small number of pushes.

Differentiable Algorithm Network (DAN) aims to combine the strengths of modal-based algorithmic reasoning and model-free deep learning for robust robot decision-making under uncertainty. Key to our approach is a unified neural network policy representation, encoding both a learned system model and an algorithm that solves the model. The network is fully differentiable and can be trained end-to-end, circumventing the difficulties of direct model learning in a partially observable setting. In contrast with conventional deep neural networks, our network representation imposes the model and algorithmic priors on the neural network architecture for improved generalization of the learned policy.

QMDP-Net

P. Karkus, D. Hsu, and W. S. Lee. QMDP-Net: Deep Learning for Planning Under Partial Observability. In Advances in Neural Information Processing Systems, NeurIPS, 2017.
PDF | code

QMDP-net employs algorithm priors on a neural network for planning under partial observability. The network encodes a learned POMDP model together with QMDP, a simple, approximate POMDP planner, thus embedding the solution structure of planning in a network learning architecture. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set, and “transfer” to other similar tasks beyond the set. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.

Navigation Network (Nav-Net)

P. Karkus, D. Hsu, and W. S. Lee. Integrating Algorithmic Planning and Deep Learning for Partially Observable Navigation. In MLPC Workshop, International Conference on Robotics and Automation, 2018.

This work extends the QMDP-net, and encodes all components of a larger robotic system in a single neural network: state estimation, planning, and control. We apply the idea to a challenging partially observable navigation task: a robot must navigate to a goal in a previously unseen 3-D environment without knowing its initial location, and instead relying on a 2-D floor map and visual observations from an onboard camera.

Particle Filter Network (PF-Net)

P. Karkus, D. Hsu, and W. S. Lee. Particle Filter Networks with Application to Visual Localization. In Conference on Robot Learning, CoRL, 2018.
PDF | code

In this work, we encode the Particle Filter algorithm in a differentiable neural network. PF-net enables end-to-end model learning, which trains the model in the context of a specific algorithm, resulting in improved performance, compared with conventional model-learning methods. We apply PF-net to visual robot localization. The robot must localize in rich 3-D environments, using only a schematic 2-D floor map. PF-net learns effective models that generalize to new, unseen environments. It can also incorporate semantic labels on the floor map.

Read more on PF-nets combined with state representation learning for sequence prediction and RL here; and map respresentation learning here.

Differentiable Algorithm Network (DAN)

P. Karkus, X. Ma, D. Hsu, L. P. Kaelbling, W. S. Lee, and T. Lozano-Pérez. Differentiable Algorithm Networks for Composable Robot Learning. In Robotics: Science and Systems, RSS, 2019. Nominated for the Best Student Paper Award and the Best Systems Paper Award.

DANs compose modules of differentiable robot algorithms and associated models into a single neural network that is trained end-to-end from data. From a model-free policy learning perspective the algorithms in DAN act as structured prior. From a model-based RL perspective, instead of training models to match the underlying system dynamics, DAN trains models end-to-end to optimize the overall task objective by backpropagating gradients through the algorithms. The benefit of task-oriented learning is that models and algorithms can adapt and compensate for each others’ imperfections. We illustrate the DAN methodology using differentiable modules for visual perception, state filtering, planning, and local control in the context of a partially observably visual navigation task in 3-D environments.

How can a delivery robot navigate reliably in a new office building, with only a schematic floor map? To tackle this challenge, we introduce Intention-Net (iNet), a two-level hierarchical navigation architecture, which integrates model-based path planning and model-free deep learning.

Intention-Net

W. Gao, D. Hsu, W. Lee, S. Shen, and K. Subramanian. Intention-Net: Integrating planning and deep learning for goal-directed autonomous navigation. In S. Levine and V. V. and K. Goldberg, editors, Conference on Robot Learning, volume 78 of Proc. Machine Learning Research, pages 185–194. 2017.

iNet is a two-level hierarchical architecture for visual navigation. It mimics human navigation in a principled way by integrating high-level planning in a crude global map and low-level neural-network motion control. At the high level, a path planner uses a crude map, e.g., a 2-D floor map, to compute a path from the robot’s current location to the final destination. The planned path provides “intentions” to local motion control. At the low level, a neural-network motion controller is trained end-to-end to provide robust local navigation. Given an intention, it maps images from a single monocular camera directly to robot control. The “intention” provides the communication interface between global path planning and local- neural-network motion control:

Discretized local move(DLM) intention. We assume that in most cases discretized navigation instructions would be enough for navigation. For example, turning left at the next junction would be enough for human drivers to drive in the real world. We mimic the same procedure by introducing four discretized intentions.
Local path and environment(LPE) intention. The DLM intention relies on pre-defined parameters. To alleviate the issue, we design a map-based intention encoded all the navigation information from the high-level planner.

Deep rEcurrent Controller for vISual navigatiON (DECISION)

B. Ai, W. Gao, Vinay, and D. Hsu. Deep Visual Navigation under Partial Observability. In IEEE Int. Conf. on Robotics & Automation, 2022.
PDF | code

The iNet navigation controller faces three challenges:

Visual complexity. Due to the high dimensionality of raw pixels, different scenes could appear dramatically different across environments, which makes traditional model-based approaches brittle.
Partial observability. The robot may not see blind-spot objects, or is unable to detect features of interest, e.g., the intention of a moving pedestrian.
Multimodal behaviors. Human navigation behaviors are multimodal in nature, and the behaviors are dependent on both local environments and the high-level navigation objective.

To overcome the challenges, DECISION exploits two ideas in designing the neural network structure:

Multi-scale temporal modeling. We use spatial memory modules to capture both low-level motions and high-level temporal semantics that are useful to the control. The rich history information can compensate for the partial observations.
Multimodal actions. We extend the idea of Mixture Density Networks (MDNs) to temporal reasoning. Specifically, we use independent memory modules for different modes to preserve the distinction of modes.

We collected a real-world human demonstration dataset consisting of 410K timesteps and trained the controller end-to-end. Our DECISION controller significantly outperforms CNNs and LSTMs.

DECISION was first deployed on our Boston Dynamics Spot robot in April 2022. The robot has been navigating autonomously for more than 150 km by March, 2022.

The human language provides a powerful natural interface for humans to communicate with robots. We aim to develop a robot system that follows natural language instructions to interact with the physical world.

Interactive Visual Grounding

M. Shridhar and D. Hsu. Interactive visual grounding of referring expressions for human-robot interaction. In Proc. Robotics: Science & Systems, 2018.

INGRESS is a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation.