OpenAI teaches a robotic hand to solve a Rubik's cube

Robots with truly humanlike dexterity are far from becoming reality, but progress accelerated by AI has brought us closer to achieving this vision than ever before. In a research paper published in September, a team of scientists at Google detailed their tests with a robotic hand that enabled it to rotate Baoding balls with minimal training data. And at a computer vision conference in June, MIT researchers presented their work on an AI model capable of predicting the tactility of physical things from snippets of visual data alone.

Now, OpenAI -- the San Francisco-based AI research firm cofounded by Elon Musk and others, with backing from luminaries like LinkedIn cofounder Reid Hoffman and former Y Combinator president Sam Altman -- says it's on the cusp of solving something of a grand challenge in robotics and AI systems: solving a Rubik's cube. Unlike breakthroughs achieved by teams at the University of California, Irvine and elsewhere, which leveraged machines tailor-built to manipulate Rubik's cubes with speed, the approach devised by OpenAI researchers uses a five-fingered humanoid hand guided by an AI model with 13,000 years of cumulative experience -- on the same order of magnitude as the 40,000 years used by OpenAI's Dota-playing bot.

It builds on experiments conducted earlier this year by Tencent and the Chinese University of Hong Kong, which involved a dexterous human-sized manipulator and a software framework comprising an AI-driven cube solver (which identified optimal Rubik's cube move sequences) and a cube operator (which controlled up to five fingers). That team reported that the combination of model-based and model-free planning and manipulation led to an average success rate of 90.3% over the course of 1,400 trials, and up to 95.2% when the AI components were trained for a further 30,000 episodes.

But Tencent's work was strictly performed in simulation -- specifically in Roboti's Multi-Joint dynamics with Contact (Mujoco), a physics engine designed for research and development in robotics and biomechanics. Scientists at OpenAI similarly trained their AI model in simulation, but they managed to successfully transfer it to a real-world robotic hand without sacrificing accuracy or robustness.

"The reason [we were] excited to work on the Rubik's Cube task is that it really requires human-level dexterity," said Matthias Plappert, a technical staff member on OpenAI's robotics team. "It's a [highly] complicated task in the sense that you need to really precisely control your fingers in order to rotate the [cube] faces. [We] wanted to see how far [we could] push this approach that we had developed initially, for last year's release."

Setup

As the OpenAI researchers explain in a paper detailing their work, solving a Rubik's cube using only simulated data is much more difficult than, say, manipulating a block, due to the amount of precision required and the complexity of estimating the cube's pose. Rubik's cubes -- which consist of 26 cubelets connected via joints and springs -- have a minimum of six internal degrees of freedom. Each of the six faces of the cube can be rotated, allowing the Rubik's cube to be scrambled, and a cube is considered solved only if all six faces have been returned to a single color each.

The team's solution is an algorithm they call automatic domain randomization (ADR), which automatically generates a distribution for training a reinforcement learning model and a module (a vision-based pose estimator) that can estimate an object's state by vision -- in this case a Rubik's cube. By way of refresher, reinforcement learning spurs AI systems in the direction of desirable goals using repeated rewards and punishments.

The researchers deployed ADR with a cube-scrambling technique that applied around 20 moves to a solved Rubik's cube to scramble it, in accordance with official World Cube Association guidelines. And they broke down the unscrambling task into subgoals like rotation, which corresponded to rotating a single cube face by 90 degrees clockwise or counter-clockwise, and a flip, which involved moving a different cube face to the top. (Because rotating the top face was generally simpler than rotating other faces, they combined a flip and a top face rotation along with other subgoals in sequence.) Concerning the actual solving of the Rubik's cube, they used existing software libraries like the Kociemba solver, which produces solution sequences of subgoals.

Hardware

The robotic hand tasked with manipulating the Rubik's cube was the Shadow Dextrous E Series Hand, which has middle and ring fingers each with three actuated and one underactuated joint and a little finger and thumb with five actuated joints, plus a wrist with two actuated joints. It's been a core part of OpenAI's robotic platform for years, and it was paired with three cameras for vision pose estimation and PhaseSpace's motion capture system and housed in a cage on coasters containing computers.

The team notes that they worked with the Shadow Robot Company, the robotic hand's manufacturer, to improve the robustness and reliability of some of the robot's components. Specifically, they increased the grip of the hand when it interacts with an object and to reduce tendon stress, and they tweaked the software stack that interfaces with it to minimize torque limits.

As for the Rubik's cube, it wasn't your average model. Rather, it was a Xiaomi's Giiker cube, which packs Bluetooth and motion sensors that sense orientation. Off-the-shelf Giiker cube models have face angle resolution of 90 degrees, but the team modified theirs to achieve a tracking accuracy of approximately 5 degrees.

Simulation

Like the Tencent team, the OpenAI researchers tapped MuJoCo to simulate the environment, hand and all, along with ORRB, a remote rendering backend built on top of the game engine Unity to render images for training the vision-based pose estimator. The simulated Rubik's cube consisted of 26 1.9-centimeter cubelets, six with a single hinge joint and 20 with three hinge joints for an effective 66 degrees of freedom. This allowed it to represent 43 quintillion fully aligned cube configurations in all, as well as all intermediate states between those configurations.

The AI policy directing the Shadow Hand had to contend with the base level of pressure exerted by the cubelets and the joints within the Rubik's cube, as well as the behaviors unique to the Giiker cube. For instance, applying force to a single cubelet was generally enough to rotate a face, as the force was propagated between neighboring elements via contact forces. And although the cube had six faces (as all Rubik's cubes), not all of them could be rotated simultaneously; perpendicular faces were locked into place barring angles small enough to allow the faces to snap into their aligned states.

That's where ADR comes in. As the researchers explain, it's a technique that generates a distribution over simulated environments by randomizing certain aspects over time (for example, the cube's visual appearance or the hand's dynamics). The initial distribution is concentrated on a single environment, but it gradually expands to synthesize data that can be used to evaluate any model's performance.

Effectively, ADR-trained models adjust their behavior to accomplish goals by implementing learning algorithms internally, which the team hypothesizes occurs when the distributions are so large that the models can't memorize special-purpose solutions (due to their finite capacity). ADR continues the training cycle as long as the models' accuracies don't dip below a predefined threshold.

So what's randomized in each environment, exactly? For one, simulator physics like geometry, friction, and gravity, as well as custom physical robot effects not modeled by the simulator (like moto backlash). That's in addition to visual elements like lighting conditions; camera positions and angles; materials and appearances of objects; the texture of the background; and even post-processing effects on rendered images.

"[That's] one of the key advantages in this approach -- as soon as you figure out how to train these models in simulation, you can have effectively endless data," Plappert said. "And then, once you figure out how to transfer the models to the robot, you can utilize them in the real world."

Rewards

Reinforcement learning involves rewards, as mentioned, and the OpenAI team defined three for this experiment: (1) The difference between the previous and the current distance of the system from the goal; (2) a reward whenever a goal was achieved; and (3) a penalty whenever the hand dropped a Rubik's cube. Random goals were generated during training, and training was considered finished whenever the AI model achieved 50 consecutive successes, timed out when trying to reach the next goal, or dropped the cube.

The researchers used Rapid for training, a framework consisting of a set of rollout workers and optimizer nodes that perform synchronous gradient descent (an essential step in machine learning) across a fleet of graphics cards. As the rollout workers gained experience, they informed the optimizer nodes, and another set of workers compared the trained AI models to reference agents.

In total, 64 Nvidia V100 graphics cards and 920 worker machines with 32 processor cores each were used to optimize the model for months while the researchers fiddled with variables like simulation fidelity, ADR algorithm, tuning hyperparameters, and even the network architecture. The optimizer nodes alone used eight V100 cards and 64 processor cores, while the nodes responsible for rendering the images used to train the vision-based pose estimator tapped a single Nvidia V100 graphics cards and eight processor cores.

From vision and from the Giiker cube's built-in sensors, the state estimator eventually learned to estimate all six face angles plus the Rubik's cube's position and orientation. The team notes that vision alone wasn't sufficient without modifying the cube, owing to the rotational symmetry of its stickers, but they hope to develop a recurrent model in the future that's capable of sussing out the cube's state strictly from camera footage.

Real-world transfer

The team next attempted to roll out the trained AI models to the real-world Shadow Hand. They evaluated performance for randomizations that they used in training for roughly two weeks, a policy trained with ADR for about two weeks, and two policies continuously trained and updated with ADR for months, each on the real robot. Over the course of 10 trials, which were repeated 10 times per policy, they started from a solved Rubik's cube and tasked the hand with moving it into a fair scramble.

For each trial, they defined two thresholds: Applying at least half of the fair scramble successfully (i.e. 22 successes) and applying at least the full fair scramble successfully (i.e. 43 successes). The best-performing model scale achieved 26.80 successes on average over 10 trials, which worked out to a 60% half success rate and a 20% full success rate. The next-best model achieved 17.8 successes on average, or 30% half successes and 10% full successes.

That might not seem particularly impressive, but notably, all models developed never-before-seen techniques to recover from perturbations like when several of the robot's fingers were tied together; when the hand was wearing a leather glove; when a blanket partially occluded the Rubik's cube; and when the cube was disturbed with a plush giraffe and pen. When the robot occasionally rotated an incorrect face, the best AI models recovered from it by rotating the face back. And when the hand attempted a face rotation but the cube slipped, resulting in a rotation of the entire cube as opposed to a specific face, the model rearranged its grasp and tried again until it succeeded eventually.

"[The] algorithm that we used here is exactly the same algorithm that we also used to train our [other] robots. [This particular] method is very general in the sense it can be applied to all sorts of problems you can think of, maybe even beyond manipulation," Plappert said. "While we focused on the Rubik's cube task, ultimately, robotics is interesting in the context of the kind of ... systems that can be applied to many tests."

The ultimate goal is generalizability, said OpenAI robotics team research scientist Lilian Weng, which fits in with OpenAI's stated mission: building "safe" human-level AI in a range of domains. Most experts believe that's a long way off where robotics are concerned -- some of the most sophisticated models today, like Aeolus, take minutes to complete tasks like picking up objects and putting them in bins. But Weng, Plappert, and colleagues believe their work is an important step toward highly robust, truly autonomous machines capable of completing virtually any task.

"Eventually, someday, we would like to make [artificial intelligence] deliver certain values to reality -- like a robot that can help alert people [to things] or do very dangerous work in that requires ... interacting the real world," said Weng. "[That']s essentially what [we're] trying to build."