Author Archives: Xiao Ma

Object manipulation in unstructured environments like homes and offices requires decision making under uncertainty. We want to investigate how we can do general, fast and robust object manipulation under uncertainty in a principled manner.

Learning To Grasp Under Uncertainty Using POMDPs

N. P. Garg, D. Hsu, and W. S. Lee. Learning To Grasp Under Uncertainty Using POMDPs. In Proc. IEEE Int. Conf. on Robotics & Automation 2019.

Robust object grasping under uncertainty is an essential capability of service robots. Many existing approaches rely on far-field sensors, such as cameras, to compute a grasp pose and perform open-loop grasp after placing gripper under the pose. This often fails as a result of sensing or environment uncertainty. This paper presents a principled, general, and efficient approach to adaptive grasping, using both tactile and visual sensing as feedback. We first model adaptive grasping as a partially observable Markov decision process (POMDP), which handles uncertainty naturally. We solve the POMDP for sampled objects from a set, in order to generate data for learning. Finally, we train a grasp policy, represented as a deep recurrent neural network (RNN), in simulation through imitation learning. By combining model-based POMDP planning and imitation learning, the proposed approach achieves robustness under uncertainty, generalization over many objects, and fast execution. In particular, we show that modeling only a small sample of objects enables us to learn a robust strategy to grasp previously unseen objects of varying shapes and recover from failure over multiple steps. Experiments on the G3DB object dataset in simulation and a smaller object set with a real robot indicate promising results.

Push-Net: Deep Planar Pushing for Objects with Unknown Physical Properties

J.K. Li, D. Hsu, and W.S. Lee. Push-Net: Deep planar pushing for objects with unknown physical properties. In Proc. Robotics: Science & Systems, 2018.

We introduce Push-Net, a deep recurrent neural network model, which enables a robot to push objects of unknown physical properties for re-positioning and re-orientation, using only visual camera images as input. The unknown physical properties is a major challenge for pushing. Push-Net overcomes the challenge by tracking a history of push interactions with an LSTM module and training an auxiliary objective function that estimates an object’s center of mass. We trained Push-Net entirely in simulation and tested it extensively on many different objects in both simulation and on two real robots, a Fetch arm and a Kinova MICO arm. Experiments suggest that Push-Net is robust and efficient. It achieved over 97% success rate in simulation on average and succeeded in all real robot experiments with a small number of pushes.

The code for Push-Net is available in github.


[Thesis 2019] Act to See and See to Act: A Robotic System for Object Retrieval in Clutter — Li Juekun

Object retrieval in clutter is an extraordinary challenge for robots. The challenges come from the incomplete knowledge of the environment. A robot has imperfect sensing due to occlusion among objects. At the same time, it must physically interact with objects of unknown physical properties.

We hypothesize that humans adopt the strategy of Act to See and See to Act to retrieve objects in clutter. We may rearrange (Act) objects to better understand (See) the scene, which in turn guides us to select better actions (Act) towards achieving the goal. This thesis adopts the same strategy that enables a robotic system to robustly and efficiently retrieve objects in clutter under uncertainties in sensing due to occlusion and uncertainties in control due to unknown objects’ physical properties, such as center of mass.

To alleviate uncertainties in sensing, we formulate the problem of object search in clutter as a Partially Observable Markov Decision Process (POMDP) with large state, action and observation spaces. With insights in spatial constraints of the problem, we improve the state-of-the-art POMDP solver, DEterminized Sparse Partially Observable Tree (DESPOT), to solve the POMDP efficiently. Through experiments in simulation, we show that the proposed planner was able to select actions to remove occlusion and reveal the target object efficiently. We further conclude that POMDP planning is effective for problems which require multi-step lookahead search.

To handle uncertainties in control, we devise Push-Net, a deep recurrent neural
network, which enables a robot to push an object from one configuration to another
robustly and efficiently. Capturing history of push interactions enables Push-Net
to push novel objects robustly. We perform experiments in simulation and on a
real robot, and show that embedding physical understanding (center of mass) about
objects in Push-Net helps select more effective push actions.

Finally, we improve and integrate both the POMDP planner and the Push-Net
into a real robotic system. We evaluate the system on a set of challenging scenarios.
The results demonstrate that the proposed system is able to retrieve the target object
robustly and efficiently in clutter. The success of the system is attributed to 1)
the ability to handle perceptual uncertainty due to occlusion; 2) the ability to push
objects of unknown physical properties in clutter; 3) the ability to perform multi-step
lookahead planning for efficient object search in complex environment.


We aim to combine the strengths of modal-based algorithmic reasoning and model-free deep learning for robust robot decision-making under uncertainty. Key to our approach is a unified neural network policy representation, encoding both a learned system model and an algorithm that solves the model. The network is fully differentiable and can be trained end-to-end, circumventing the difficulties of direct model learning in a partially observable setting. In contrast with conventional deep neural networks, our network representation imposes the model and algorithmic priors on the neural network architecture for improved generalization of the learned policy.


P. Karkus, D. Hsu, and W. S. Lee. QMDP-Net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, 2017.

QMDP-net employs algorithm priors on a neural network for planning under partial observability. The network encodes a learned POMDP model together with QMDP, a simple, approximate POMDP planner, thus embedding the solution structure of planning in a network learning architecture. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set, and “transfer” to other similar tasks beyond the set. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.

Navigation Network (Nav-Net)

P. Karkus, D. Hsu, and W. S. Lee. Integrating Algorithmic Planning and Deep Learning for Partially Observable Navigation. In MLPC Workshop, International Conference on Robotics and Automation, 2018.

This work extends the QMDP-net, and encodes all components of a larger robotic system in a single neural network: state estimation, planning, and control. We apply the idea to a challenging partially observable navigation task: a robot must navigate to a goal in a previously unseen 3-D environment without knowing its initial location, and instead relying on a 2-D floor map and visual observations from an onboard camera.

Particle Filter Network (PF-Net)

P. Karkus, D. Hsu, and W. S. Lee. Particle Filter Networks with Application to Visual Localization. arXiv preprint arXiv:1805.08975, 2018.

In this work, we encode the Particle Filter algorithm in a differentiable neural network. PF-net enables end-to-end model learning, which trains the model in the context of a specific algorithm, resulting in improved performance, compared with conventional model-learning methods. We apply PF-net to visual robot localization. The robot must localize in rich 3-D environments, using only a schematic 2-D floor map. PF-net learns effective models that generalize to new, unseen environments. It can also incorporate semantic labels on the floor map.

How can a delivery robot navigate reliably to a destination in a new office building, with minimal prior information? To tackle this challenge, this paper introduces a two-level hierarchical method, which integrates model-free deep learning and model-based path planning. At the low level, a neural-network motion controller, called the intention-net, is trained end-to-end to provide robust local navigation. Intention-net maps images from a single monocular camera and given “intentions” directly to robot control. At the high level, a path planner uses a crude map, e.g., a 2-D floor plan, to compute a path from the robot’s current location to the goal. The planned path provides intentions to the intention-net. Preliminary experiments suggest that the learned motion controller is robust against perceptual uncertainty and by integrating with a path planner, it generalizes effectively to new environments and goals.


W. Gao, D. Hsu, W. Lee, S. Shen, and K. Subramanian. Intention-Net: Integrating planning and deep learning for goal-directed autonomous navigation. In S. Levine and V. V. and K. Goldberg, editors, Conference on Robot Learning, volume 78 of Proc. Machine Learning Research, pages 185–194. 2017.

We study the problem of visual navigation in new environments. Humans can easily navigate in an arbitrary environment with crude floor plan and perfect local collision-free navigation. We try to mimic the human navigation in a principled way by integrating high-level planning in the crude global map, i.e. the floor plan and low-level neural-network motion controller. We design the “intention” as the interface to communicate with high-level path planning and local-level neural-network motion controller.

Two intentions:

We mainly design two kinds of intention parsers from high-level path planning.

Discretized local move(DLM) intention: We assume that in most cases discretized navigation instructions would be enough for navigation. For example, turning left at the next junction would be enough for human drivers to drive in the real world. We mimic the same procedure by introducing four discretized intentions.

Local path and environment(LPE) intention: The DLM intention is too ad-hoc and relies on pre-defined parameters. To alleviate the issue, we design a map-based intention encoded all the navigation information from the high-level planner.


Our goal is to develop a formal computational framework of trust, supported by experimental evidence and predictive models of human behaviors, to enable automated decision-making for fluid collaboration between humans and robots.


M. Chen, S. Nikolaidis, H. Soh, D. Hsu, and S. Srinivasa. Planning with trust for human-robot collaborationIn Proc. ACM/IEEE Int. Conf. on Human-Robot Interaction, 2018.

Trust is essential for human-robot collaboration and user adoption of autonomous systems, such as robot assistants. This paper introduces a computational model which integrates trust into robot decision-making. Specifically, we learn from data a partially observable Markov decision process (POMDP) with human trust as a latent variable. The trust-POMDP model provides a principled approach for the robot to (i) infer the trust of a human teammate through interaction, (ii) reason about the effect of its own actions on human behaviors, and (iii) choose actions that maximize team performance over the long term. We validated the model through
human subject experiments on a table-clearing task in simulation (201 participants) and with a real robot (20 participants). The results show that the trust-POMDP improves human-robot team performance in this task. They further suggest that maximizing trust in itself may not improve team performance.

The human language provides a powerful natural interface for humans to communicate with robots. We aim to develop a robot system that follows natural language instructions to interact with the physical world.

Interactive Visual Grounding

M. Shridhar and D. Hsu. Interactive visual grounding of referring expressions for human-robot interaction. In Proc. Robotics: Science & Systems, 2018.

INGRESS is a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation.

Compared with handheld cameras widely used today, a camera mounted on a flying drone affords the user much greater freedom in finding the point of view for a perfect photo shot.  In the future, many people may take along compact flying cameras and use their touchscreen mobile devices as viewfinders to take photos. Our goal is to explore the user interaction design and system implementation issues  for a flying camera, which  leverages the autonomous flying capability of a drone-mounted camera for a great photo-taking experience.


Z. Lan, M. Shridhar, D. Hsu, and S. Zhao. XPose: Reinventing user interaction with flying cameras. In Proc. Robotics: Science & Systems, 2017.

XPose is a new touch-based interactive system for photo taking, designed to take advantage of the autonomous flying capability of a drone-mounted camera. It enables the user to interact with photos directly and focus on taking photos instead of piloting the drone. XPose introduces a two-stage eXplore-and-comPose approach to photo taking in static scenes. In the first stage, the user explores the “photo space” through predefined interaction modes: Orbit, Pano, and Zigzag. Under each mode, the camera visits many points of view (POVs) and takes exploratory photos through autonomous drone flying. In the second stage, the user restores a selected POV with the help of a gallery preview and uses direct manipulation gestures to refine the POV and compose a final photo.

Perspective-2-Point (P2P)

Z. Lan, D. Hsu, and G. Lee. Solving the perspective-2-point problem for flying-camera photo composition. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, 2018.

This work focuses on a common situation in photo-taking, i.e., the underlying viewpoint search problem for composing a photo with two objects of interest. We model it as a Perspective-2-Point (P2P) problem, which is under-constrained to determine the six degrees-of-freedom camera pose uniquely. By incorporating the user’s composition requirements and minimizing the camera’s flying distance, we form a constrained nonlinear optimization problem and solve it in closed form.