Category Archives: Research

This research examines particle-based object manipulation. We start with Particle-based Object Manipulation (PROMPT), designed for the manipulation of unknown rigid objects. PROMPT leverages a particle-based object representation to bridge visual perception and robot control. We expand this particle representation to deformable objects and introduce DaXBench, a differentiable simulation framework to benchmark deformable object manipulation (DOM) techniques. DaxBench enables deeper exploration and systematic comparison of DOM techniques. Finally, we present Differentiable Particles (DiPac), a new DOM algorithm, using the particle presentation. The differentiable particle dynamics aids in the efficient estimation of dynamics parameters and action optimization. Together, these new tools address the challenge of robot manipulation of unknown deformable objects, paving the way for new progress in the field.

Prompt

S. Chen, X. Ma, Y. Lu, D. Hsu. Ab initio particle-based object manipulation. Proceedings of Robotics: Science and Systems, RSS 2021.
PDF | Code

We present Particle-based Object Manipulation (Prompt), a new approach to robot manipulation of novel objects ab initio, without prior object models or pre-training on a large object data set. The key element of Prompt is a particle-based object representation, in which each particle represents a point in the object, the local geometric, physical, and other features of the point, and also its relation with other particles. Like the model-based analytic approaches to manipulation, the particle representation enables the robot to reason about the object’s geometry and dynamics in order to choose suitable manipulation actions. Like the data-driven approaches, the particle representation is learned online in real-time from visual sensor input, specifically, multi-view RGB images. The particle representation thus connects visual perception with robot control. Prompt combines the benefits of both model-based reasoning and data-driven learning.

DaxBench

S. Chen*, Y. Xu*, C. Yu*, L. Li, X. Ma, Z. Xu, D. Hsu. DaxBench: Benchmarking Deformable Object Manipulation with Differentiable Physics. Proceedings of The Eleventh International Conference on Learning Representations, ICLR 2023 (Oral). (*Co-first author)

PDF | Code | Website

We extend the particle representation to deformable object manipulation with differentiable dynamics. Deformable object manipulation (DOM) is a long-standing challenge in robotics and has attracted significant interest recently. This paper presents DaXBench, a differentiable simulation framework for DOM. While existing work often focuses on a specific type of deformable objects, DaXBench supports fluid, rope, cloth …; it provides a general-purpose benchmark to evaluate widely different DOM methods, including planning, imitation learning, and reinforcement learning. DaXBench hence serves as a working bench to facilitate research on deformable object manipulation using particles.

DiPac

S. Chen, Y. Xu, C. Yu, L. Li, X. Ma, D. Hsu. Differentiable Particles for Deformable Object Manipulation

PDF | Code

Furthermore, we utilize the concept of differentiable particles to manage a wide variety of deformable objects. The manipulation of such objects, including rope, cloth, or beans, presents a significant challenge due to their extensive degrees of freedom and complex non-linear dynamics. This paper introduces Differentiable Particles (DiPac), a novel algorithm designed for the manipulation of deformable objects. DiPac interprets a deformable object as a collection of particles and employs a differentiable particle dynamics simulator to rationalize robotic manipulation. DiPac uses a single representation – particles – for a diverse array of objects: scattered beans, rope, T-shirts, and so forth. The use of differentiable dynamics empowers DiPac to efficiently (i) estimate the dynamics parameters, effectively reducing the simulation-to-real gap, and (ii) select the best action by backpropagating the gradient along sampled trajectories.

Deep Visual Navigation under Partial Observability

How do humans navigate? We navigate with almost exclusive visual sensing and coarse floor plans. To reach a destination, we demonstrate a diverse set of skills, such as obstacle avoidance, and we use tools, such as buses and elevators, to traverse through different locations. All these are not yet possible for robots. To get closer to human-level navigation performance, we propose a controller that is capable of learning complex policies from data.

 

DECISION: Deep rEcurrent Controller for vISual navigatiON

B. Ai, W. Gao, Vinay, and D. Hsu. Deep Visual Navigation under Partial Observability. In International Conference on Robotics and Automation (ICRA), 2022. [paper][code][video]

 

The key lies in designing the structural components that allow the controller to learn the desired capabilities. Here we identified three challenges: 

(i) Partial observability: The robot may not see blind-spot objects, or is unable to detect features of interest, e.g., the intention of a moving pedestrian.

(ii) Multimodal behaviors: Human navigation behaviors are multimodal in nature, and the behaviors are dependent on both local environments and the high-level navigation objective.

(iii) Visual complexity: Due to the high dimensionality of raw pixels, different scenes could appear dramatically different across environments, which makes traditional model-based approaches brittle.

 

To resolve the challenges, we propose two key structure designs

(i) Multi-scale temporal modeling: We use spatial memory modules to capture both low-level motions and high-level temporal semantics that are useful to the control. The rich history information can compensate for the partial observations.

(ii) Multimodal actions: We extend the idea of Mixture Density Networks (MDNs) to temporal reasoning. Specifically, we use independent memory modules for different modes to preserve the distinction of modes.

 

 

We collected a  real-world human demonstration dataset consisting of 410K timesteps and train the controller end-to-end.  Our DECISION controller significantly outperforms CNNs and LSTMs. 

The controller was first deployed on our Boston Dynamics Spot robot in April 2022. It has been navigating autonomously for more than 150km at the time of writing. It has been incrementally improved over time, as an integral component of the GoAnywhere@NUS project.

One feature of this work is that all experimental results are obtained in the real world. Here is our demo video, showing our Spot robot traversing many different locations on our university campus.

Our goal is to exploit the potential of particle representation to enable robots to handle general manipulation tasks, including rigid and deformable objects and liquids.

Ab Initio Particle-based Object Manipulation

S. Chen, X. Ma, Y. Lu, D. Hsu. Ab Initio Particle-based Object Manipulation. In Robotics: Science and Systems, RSS, 2021. [PDF][Code]

This paper presents Particle-based Object Manipulation (Prompt), a new approach to robot manipulation of novel objects ab initio, without prior object models or pre-training on a large object data set. The key element of Prompt is a  particle-based object representation, in which each particle represents a point in the object, the local geometric, physical, and other features of the point, and also its relation with other particles. Like the model-based analytic approaches to manipulation, the particle representation enables the robot to reason about the object’s geometry and dynamics in order to choose suitable manipulation actions. Like the data-driven approaches, the particle representation is learned online in real-time from visual sensor input, specifically, multi-view RGB images. The particle representation thus connects visual perception with  robot  control. Prompt combines the benefits of both model-based reasoning and data-driven learning. We show empirically that Prompt successfully handles a variety of everyday objects, some of which are transparent. It handles various manipulation tasks, including grasping, pushing, etc,. Our experiments also show that Prompt outperforms a state-of-the-art data-driven grasping method on the daily objects, even though it does not use any offline training data.

 

Real-world robots often face a stochastic and partially observable environment. POMDP planning offers a principled approach to handle such uncertainties. However, POMDPs are also well-known for their computational complexity that grows exponentially with the problem scale and the planning horizon, namely, the “curse of dimension” and the “curse of history”. We aim to solve large-scale POMDP planning efficiently and scale-up to complex real-world tasks. The key to our solution includes: (i) one can leverage massive parallelization and powerful hardware to mitigate the “curse of dimension”, and (ii) one can integrate POMDP planning with learning to overcome the “curse of history”.

A perfect example of large-scale real-world tasks is Crowd-Driving: driving among an unregulated crowd of heterogeneous traffic agents. In crowd-driving, the robot vehicle must contend with a large-scale and highly interactive environment. Such complex environments require sophisticated long-term planning to achieve human-level performance. However, efficient planning is often prohibited due to real-time constraints. Shown below is a 3-mins talk introducing the crowd-driving problem and discussing how the problem can be modeled and solved using a combination of planning and learning. This video also summarizes the core technical approaches to be introduced in this post:

HyP-DESPOT

Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee.  HyP-DESPOT: A Hybrid Parallel Algorithm for Online Planning under Uncertainty. Int. J. Robotics Research, 40(2–3), 2021.
PDF | code

Hybrid Parallel DESPOT (HyP-DESPOT) is a massively parallel belief tree search algorithm that leverages both CPU and GPU parallelization in order to achieve real-time planning performance for complex tasks with large state, action, and observation spaces. In multi-core CPUs, HyP-DESPOT performs parallel DESPOT tree search by simultaneously traversing multiple independent paths; In the GPU, HyP-DESPOT performs parallel Monte Carlo simulations at the leaf nodes of the search tree. HyP-DESPOT provably converges in finite time under moderate conditions and guarantees near-optimality of the solution. In practice, HyP-DESPOT speeds up online planning by up to a factor of several hundred in several challenging robotic tasks.

LeTS-Drive

Panpan Cai, Yuanfu Luo, Aseem Saxena, David Hsu, and Wee Sun Lee. LeTS-Drive: Driving in a Crowd by Learning from Tree Search. In Proc. Robotics: Science & Systems, 2019.

LeTS-Drive is a crowd-driving algorithm that integrates online POMDP planning with deep learning. The core idea is to constrain belief tree search to short-term futures and use learning for long-term futures. It consists of two phases. In the offline phase, we learn a policy and the corresponding value function by imitating the belief tree search expert. In the online phase, LeTS-Drive uses the learned policy and value function to inform and guide the online belief tree search. LeTS-Drive leverages the robustness of planning and the runtime efficiency of learning to enhance the performance of both. By integrating planning and learning, LeTS-Drive outperforms either planning or imitation learning alone and develops sophisticated crowd-driving skills.

LeTS-Drive-Auto

Panpan Cai, David Hsu, Think Locally, Learn Globally. to be uploaded to Arxiv.

LeTS-Drive-Auto advances LeTS-Drive by integrating planning and learning in a close-loop. Similar to LeTS-Drive, it learns a policy network and a value network as representations of global priors. Differently, LeTS-Drive-Auto builds mutual communication between the planner and the learner by 1) guiding the online planner using learned global priors; 2) learning the global priors from data sent back from the guided planner. LeTS-Drive-Auto is a new reinforcement learning algorithm that the planner serves as the policy improvement operator and evolves together with the learner. By integrating planning and learning in a close-loop, LeTS-Drive-Auto achieves superior driving performance in crowded urban environments in simulation, outperforming either planning or learning alone, and largely exceeds the capability of open-loop integration (LeTS-Drive).

We aim to model, predict, and simulate the motion of unregulated crowds of traffic participants. We take a geometric approach and model traffic agent motion as a constrained optimization problem in the velocity space of agents. The objective of the problem is to maximize the cooperative navigation efficiency for all agents, while the constraints are induced by collision avoidance, vehicle kinematics, and the road context. We have developed two motion models for two typical scenarios, including crowds of pedestrians in open spaces and heterogenous traffic crowds on urban roads. Making use of these models, a planning system can perform more accurate long-term predictions and thus enhance the planning performance. We have further released an open-source simulator that uses these models to simulate crowd-driving in real-world urban maps. We envision the simulator to facilitate training, testing, and development of crowd-driving algorithms.

PORCA

Yuanfu Luo, Panpan Cai, Aniket Bera, David Hsu, Wee Sun Lee, and Dinesh Manocha. PORCA: Modeling and planning for autonomous driving among many pedestrians. IEEE Robotics and Automation Letters, 2018. [PDF]

This paper presents a planning system for autonomous driving among many pedestrians. A key ingredient of our approach is PORCA, a pedestrian motion prediction model that accounts for both a pedestrian’s global navigation intention and local interactions with the vehicle and other pedestrians. Since the autonomous vehicle does not know the pedestrian’s intention a priori, it requires a planning algorithm to reason about the uncertainty on pedestrian intentions. Our planning system combines a POMDP algorithm with the pedestrian motion model and runs in near real-time. Experiments show that it enables a robot vehicle to drive safely, efficiently, and smoothly among a dense crowd of pedestrians.

GAMMA

Yuanfu Luo, Panpan Cai, David Hsu, and Wee Sun Lee. GAMMA: A general agent motion prediction model for autonomous driving. Arxiv, 2019. [PDF][Code]

Urban environments usually contain mixed traffic of heterogeneous agents such as pedestrians, bicycles, cars, buses, etc.. In such environments, traffic motion prediction becomes extremely challenging because of the diverse dynamics and geometry of traffic agents, complex road conditions, and intensive interactions among the agents. In this paper, we proposed GAMMA, a general agent motion prediction model for autonomous driving, that can predict the motion of heterogeneous traffic agents with different kinematics, geometry, human agents’ inner states, etc.. GAMMA formalizes motion prediction as geometric optimization in the velocity space, and integrates physical constraints and human inner states into this unified framework. Our results show that GAMMA outperforms both geometric and learning-based approaches significantly on diverse real-world datasets.

STAR

Cunjun Yu*, Xiao Ma*, Jiawei Ren, Haiyu Zhao, Shuai Yi. Spatio-Temporal Grpah Transformer Networks for Pedestrian Trajectory Prediction. in Proc. European Conf. on Computer Vision, 2020. [PDF][Code][Project]

Understanding crowd motion dynamics is critical to real-world applications, e.g., surveillance systems and autonomous driving. This is challenging because it requires effectively modeling the socially aware crowd spatial interaction and complex temporal dependencies. We believe attention is the most important factor for trajectory prediction. In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a novel Transformer-based graph convolution mechanism. The inter-graph temporal dependencies are modeled by separate temporal Transformers. STAR captures complex spatio-temporal interactions by interleaving between spatial and temporal Transformers. To calibrate the temporal prediction for the long-lasting effect of disappeared pedestrians, we introduce a read-writable external memory module, consistently being updated by the temporal Transformer. We show that with only attention mechanism, STAR achieves state-of-the-art performance on 5 commonly used real-world pedestrian prediction datasets.

SUMMIT

Panpan Cai*, Yiyuan Lee*, Yuanfu Luo, David Hsu. SUMMIT: A Simulator for Urban Driving in Massive Mixed Traffic. in Proc. IEEE Int. Conf. on Robotics & Automation, 2020. [PDF][Code]

This paper presents SUMMIT, a high-fidelity simulator that facilitates the development and testing of crowd-driving algorithms. By leveraging the open-source OpenStreetMap map database and the GAMMA motion model developed in our earlier work, SUMMIT simulates dense, unregulated urban traffic for heterogeneous agents at any worldwide location that OpenStreetMap supports. SUMMIT is built as an extension of CARLA and inherits from it the physics and visual realism for autonomous driving simulation. SUMMIT supports a wide range of applications, including perception, vehicle control and planning, and end-to-end learning. We provide real-world benchmark scenes to show that SUMMIT generates complex, realistic traffic behaviors in challenging crowd-driving settings. The simulator also comes with a context-aware POMDP planner as a driving expert and a reference to future crowd-driving algorithms.

 

We aim at robust reinforcement learning (RL) for tasks with complex partial observations. While existing RL algorithms have achieved great success in simulated environments, such as Atari games, Go, and even Dota, generalizing them to realistic environments with complex partial observations remains challenging. The key to our approach is to learn from the partial observations a robust and compact latent state representation. Specifically, we handle partial observability by combining the particle filter algorithm with recurrent neural networks. We tackle complex observations through efficient discriminative model learning, which focuses on learning observations required for action selection rather the entire high-dimensional observation space.

Particle Filter Recurrent Neural Networks

X. Ma, P. Karkus, D. Hsu, and W.S. Lee. Particle filter recurrent neural networks. In Proc. AAAI Conf. on Artificial Intelligence, 2020. [PDF][Code]

Recurrent neural networks (RNNs) have been extraordinarily successful for prediction with sequential data. To tackle highly variable and noisy real-world data, we introduce Particle Filter Recurrent Neural Networks (PF-RNNs), a new RNN family that explicitly models uncertainty in its internal structure: while an RNN relies on a long, deterministic latent state vector, a PF-RNN maintains a latent state distribution, approximated as a set of particles. For effective learning, we provide a fully differentiable particle filter algorithm that updates the PF-RNN latent state distribution according to the Bayes rule. Experiments demonstrate that the proposed PF-RNNs outperform the corresponding standard gated RNNs on a synthetic robot localization dataset and 10 real-world sequence prediction datasets for text classification, stock price prediction, etc.

Discriminative Particle Filter Reinforcement Learning

X. Ma, P. Karkus, D. Hsu, W.S. Lee, and N. Ye. Discriminative particle filter reinforcement learning for complex partial observations. In Proc. Int. Conf. on Learning Representations, 2020[PDF][Project][Code]

Real-world decision making often requires reasoning in a partially observable environment using information obtained from complex visual observations — major challenges for deep reinforcement learning. In this paper, we introduce the Discriminative Particle Filter Reinforcement Learning (DPFRL), a reinforcement learning method that encodes a particle filter structure with learned discriminative transition and observation models in a neural network. The particle filter structure allows for reasoning with partial observations, and discriminative parameterization allows modeling only the information in the complex observations that are relevant for decision making. In experiments, we show that in most cases DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, as well as in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark that we introduce. We also show that DPFRL performs well when applied to a visual navigation domain with real-world data.

Contrastive Variational Reinforcement Learning

X. Ma, S. Chen, D. Hsu and W.S. Lee. Contrastive variational reinforcement learning for complex observations. In Proc. 4th Conf. on Robot Learning, 2020. [PDF][Project][Code]

Deep reinforcement learning (DRL) has achieved significant success in various robot tasks: manipulation, navigation, etc. However, complex visual observations in natural environments remains a major challenge. This paper presents Contrastive Variational Reinforcement Learning (CVRL), a model-based method that tackles complex visual observations in DRL. CVRL learns a contrastive variational model by maximizing the mutual information between latent states and observations discriminatively, through contrastive learning. It avoids modeling the complex observation space unnecessarily, as the commonly used generative observation model often does, and is significantly more robust. CVRL achieves comparable performance with state-of-the-art model-based DRL methods on standard Mujoco tasks. It significantly outperforms them on Natural Mujoco tasks and a robot box-pushing task with complex observations, e.g., dynamic shadows.

Natural Mujoco Tasks

                      

Object manipulation in unstructured environments such as homes and offices requires decision making under uncertainty. We want to investigate how we can do general, fast and robust object manipulation under uncertainty in a principled manner.

Learning To Grasp Under Uncertainty Using POMDPs

N. P. Garg, D. Hsu, and W. S. Lee. Learning To Grasp Under Uncertainty Using POMDPs. In Proc. IEEE Int. Conf. on Robotics & Automation 2019.

Robust object grasping under uncertainty is an essential capability of service robots. Many existing approaches rely on far-field sensors, such as cameras, to compute a grasp pose and perform open-loop grasp after placing gripper under the pose. This often fails as a result of sensing or environment uncertainty. This paper presents a principled, general, and efficient approach to adaptive grasping, using both tactile and visual sensing as feedback. We first model adaptive grasping as a partially observable Markov decision process (POMDP), which handles uncertainty naturally. We solve the POMDP for sampled objects from a set, in order to generate data for learning. Finally, we train a grasp policy, represented as a deep recurrent neural network (RNN), in simulation through imitation learning. By combining model-based POMDP planning and imitation learning, the proposed approach achieves robustness under uncertainty, generalization over many objects, and fast execution. In particular, we show that modeling only a small sample of objects enables us to learn a robust strategy to grasp previously unseen objects of varying shapes and recover from failure over multiple steps. Experiments on the G3DB object dataset in simulation and a smaller object set with a real robot indicate promising results.

Push-Net: Deep Planar Pushing for Objects with Unknown Physical Properties

J.K. Li, D. Hsu, and W.S. Lee. Push-Net: Deep planar pushing for objects with unknown physical properties. In Proc. Robotics: Science & Systems, 2018.
PDF  code

We introduce Push-Net, a deep recurrent neural network model, which enables a robot to push objects of unknown physical properties for re-positioning and re-orientation, using only visual camera images as input. The unknown physical properties is a major challenge for pushing. Push-Net overcomes the challenge by tracking a history of push interactions with an LSTM module and training an auxiliary objective function that estimates an object’s center of mass. We trained Push-Net entirely in simulation and tested it extensively on many different objects in both simulation and on two real robots, a Fetch arm and a Kinova MICO arm. Experiments suggest that Push-Net is robust and efficient. It achieved over 97% success rate in simulation on average and succeeded in all real robot experiments with a small number of pushes.

 

Differentiable Algorithm Network (DAN) aims to combine the strengths of modal-based algorithmic reasoning and model-free deep learning for robust robot decision-making under uncertainty. Key to our approach is a unified neural network policy representation, encoding both a learned system model and an algorithm that solves the model. The network is fully differentiable and can be trained end-to-end, circumventing the difficulties of direct model learning in a partially observable setting. In contrast with conventional deep neural networks, our network representation imposes the model and algorithmic priors on the neural network architecture for improved generalization of the learned policy.

QMDP-Net

P. Karkus, D. Hsu, and W. S. Lee. QMDP-Net: Deep Learning for Planning Under Partial Observability. In Advances in Neural Information Processing Systems, NeurIPS, 2017.
PDF | code

QMDP-net employs algorithm priors on a neural network for planning under partial observability. The network encodes a learned POMDP model together with QMDP, a simple, approximate POMDP planner, thus embedding the solution structure of planning in a network learning architecture. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set, and “transfer” to other similar tasks beyond the set. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.

Navigation Network (Nav-Net)

P. Karkus, D. Hsu, and W. S. Lee. Integrating Algorithmic Planning and Deep Learning for Partially Observable Navigation. In MLPC Workshop, International Conference on Robotics and Automation, 2018.

This work extends the QMDP-net, and encodes all components of a larger robotic system in a single neural network: state estimation, planning, and control. We apply the idea to a challenging partially observable navigation task: a robot must navigate to a goal in a previously unseen 3-D environment without knowing its initial location, and instead relying on a 2-D floor map and visual observations from an onboard camera.

Particle Filter Network (PF-Net)

P. Karkus, D. Hsu, and W. S. Lee. Particle Filter Networks with Application to Visual Localization. In Conference on Robot Learning, CoRL, 2018.
PDF | code

In this work, we encode the Particle Filter algorithm in a differentiable neural network. PF-net enables end-to-end model learning, which trains the model in the context of a specific algorithm, resulting in improved performance, compared with conventional model-learning methods. We apply PF-net to visual robot localization. The robot must localize in rich 3-D environments, using only a schematic 2-D floor map. PF-net learns effective models that generalize to new, unseen environments. It can also incorporate semantic labels on the floor map.

Read more on PF-nets combined with state representation learning for sequence prediction and RL here; and map respresentation learning here.

Differentiable Algorithm Network (DAN)

P. Karkus, X. Ma, D. Hsu, L. P. Kaelbling, W. S. Lee, and T. Lozano-Pérez. Differentiable Algorithm Networks for Composable Robot Learning. In Robotics: Science and Systems, RSS, 2019. Nominated for the Best Student Paper Award and the Best Systems Paper Award.

DANs compose modules of differentiable robot algorithms and associated models into a single neural network that is trained end-to-end from data. From a model-free policy learning perspective the algorithms in DAN act as structured prior. From a model-based RL perspective, instead of training models to match the underlying system dynamics, DAN trains models end-to-end to optimize the overall task objective by backpropagating gradients through the algorithms. The benefit of task-oriented learning is that models and algorithms can adapt and compensate for each others’ imperfections. We illustrate the DAN methodology using differentiable modules for visual perception, state filtering, planning, and local control in the context of a partially observably visual navigation task in 3-D environments.

How can a delivery robot navigate reliably  in a new office building, with only a schematic floor map? To tackle this challenge, we introduce Intention-Net (iNet), a two-level hierarchical navigation architecture, which integrates model-based path planning and model-free deep learning.

Intention-Net

W. Gao, D. Hsu, W. Lee, S. Shen, and K. Subramanian. Intention-Net: Integrating planning and deep learning for goal-directed autonomous navigation. In S. Levine and V. V. and K. Goldberg, editors, Conference on Robot Learning, volume 78 of Proc. Machine Learning Research, pages 185–194. 2017.

iNet is a two-level hierarchical architecture for visual navigation.  It mimics  human navigation in a principled way by integrating high-level planning in a crude global map and low-level neural-network motion control. At the high level, a path planner uses a crude map, e.g., a 2-D floor map, to compute a path from the robot’s current location to the final destination. The planned path provides “intentions” to  local motion control. At the low level, a neural-network motion controller is trained end-to-end to provide robust local navigation.  Given an intention, it maps images from a single monocular camera  directly to robot control. The “intention” provides the communication interface between global path planning and local- neural-network motion control:

  • Discretized local move(DLM) intention. We assume that in most cases discretized navigation instructions would be enough for navigation. For example, turning left at the next junction would be enough for human drivers to drive in the real world. We mimic the same procedure by introducing four discretized intentions.
  • Local path and environment(LPE) intention. The DLM intention relies on pre-defined parameters. To alleviate the issue, we design a map-based intention encoded all the navigation information from the high-level planner.

Deep rEcurrent Controller for vISual navigatiON (DECISION)

B. Ai, W. Gao, Vinay, and D. Hsu. Deep Visual Navigation under Partial Observability. In IEEE Int. Conf.  on Robotics & Automation, 2022.
PDF  | code 

The iNet navigation controller faces three challenges: 

  • Visual complexity. Due to the high dimensionality of raw pixels, different scenes could appear dramatically different across environments, which makes traditional model-based approaches brittle.
  • Partial observability. The robot may not see blind-spot objects, or is unable to detect features of interest, e.g., the intention of a moving pedestrian.
  • Multimodal behaviors. Human navigation behaviors are multimodal in nature, and the behaviors are dependent on both local environments and the high-level navigation objective.

To overcome the challenges, DECISION exploits two ideas in designing  the neural network structure:

  • Multi-scale temporal modeling. We use spatial memory modules to capture both low-level motions and high-level temporal semantics that are useful to the control. The rich history information can compensate for the partial observations.
  • Multimodal actions. We extend the idea of Mixture Density Networks (MDNs) to temporal reasoning. Specifically, we use independent memory modules for different modes to preserve the distinction of modes.

We collected a  real-world human demonstration dataset consisting of 410K timesteps and trained the controller end-to-end.  Our DECISION controller significantly outperforms CNNs and LSTMs. 

DECISION was first deployed on our Boston Dynamics Spot robot in April 2022. The robot has been navigating autonomously for more than 150 km by March, 2022. 

The human language provides a powerful natural interface for humans to communicate with robots. We aim to develop a robot system that follows natural language instructions to interact with the physical world.

Interactive Visual Grounding

M. Shridhar and D. Hsu. Interactive visual grounding of referring expressions for human-robot interaction. In Proc. Robotics: Science & Systems, 2018.

INGRESS is a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation.