## Summary

Here we illustrate the performance of MuJoCo and compare it to alternative simulators. The details can be found in the sections below. The take-away points in each section can be summarized as follows:

Performance of MuJoCo 2.0

MuJoCo 2.0 has code improvements making it 20% to 80% faster than MuJoCo 1.5 depending on the model. On an Intel i9-7900X processor, all-core performance is 300,000 samples per second on a 3D humanoid model, and 7,000 samples per second on a system with 1000 particles.

Convergence of different constraint solvers

MuJoCo has a direct Newton solver with second-order convergence, in addition to Projected Gauss Seidel and Conjugate Gradient solvers with first-order convergence. In more challenging models, Newton can compute contact forces accurate to numerical precision in 5 iterations, while Gauss Seidel (used in every other engine) can take thousands of iterations to do the same.

Comparison to Flex

On a 3D humanoid model, MuJoCo on CPU is substantially faster than Flex om GPU with the GPU fully saturated, although the comparison is indirect and the setting is not identical. On particle simulations Flex will most likely win, while intermediate cases involving flexible objects remain to be compared. MuJoCo 2.0 allows full interaction between flexible objects and multi-joint dynamics.

Importance of model design

Benchmarks using poorly designed MuJoCo models can create the impression of lower performance, but this can be resolved by designing the models better. MuJoCo 2.0 has an optimizing model compiler which can improve the model design automatically, yielding 50% speedup in some cases. But it cannot always do it, just like a C++ compiler cannot re-write poorly written code.

Comparison to SD/FAST

On multi-joint systems without contacts or other constraints, MuJoCo performs similarly to SD/FAST, even though SD/FAST generates model-specific C code and has more limited functionality.

Comparison to Bullet, Havoc, ODE and PhysX

Compared to gaming engines, MuJoCo is both faster and more accurate in multi-joint dynamics simulations relevant to robotics and control. In simulations with many floating bodies, MuJoCo 2.0 has code improvements yielding similar speed as gaming engines, and scaling linearly with the number of bodies.

## Performance of MuJoCo 2.0

MuJoCo 2.0 has multiple code optimizations which increase simulation speed compared to MuJoCo 1.5, by 20% for smaller models and 40% to 80% for larger models. It can also simulate composite flexible objects, modeled as large collections of rigid bodies coupled with soft constraints (where speed is particularly important). Here we provide test results obtained with the code sample 'speedtest' on several models in the software distribution.

The table below shows size statistics for each model, and number of evaluations per second in one thread and in 20 threads. The latter is more representative of applications that rely on parallel sampling. We tested on a i9-7900X processor, Windows 10, no over-clocking. This processor has 10 physical cores, and we have observed that maximum throughput is achieved when the number of parallel simulations equals the number of logical cores. The simulation time step in all cases is 5 ms. The last column shows the speedup of MuJoCo 2.0 compared to MuJoCo 1.5. The large speedup on the ellipsoid model is because the new soft objects exposed a particular inefficiency in the collision pruning mechanism, which was fixed in MuJoCo 2.0.

model | bodies | DOFs | constraints | evals / s 1 thread |
evals / s 20 threads |
speedup in version 2.0 |

humanoid | 14 | 27 | 56 | 28,100 | 300,800 | 20 % |

humanoid100 | 114 | 627 | 1041 | 1,600 | 17,100 | 38 % |

hammock | 113 | 312 | 264 | 3,340 | 40,400 | 83 % |

particle | 1002 | 3000 | 998 | 880 | 7,150 | 52 % |

grid2 | 83 | 243 | 475 | 1,800 | 19,350 | 30 % |

ellipsoid | 213 | 216 | 670 | 3,050 | 31,600 | 675 % |

We see that performance with all cores fully loaded is roughly 10x compared to single-thread performance, which is to be expected but also shows that there are no bottlenecks in scaling to multiple threads. Larger models of course take longer to simulate. The plot below shows single-thread performance as a function of model size, where size is defined as (#bodies + #DOFs + #constraints). We have plotted performance as CPU time per step (the inverse of the above numbers) to make the point that scaling is roughly linear. This is because MuJoCo exploits sparsity.

## Convergence of different constraint solvers

MuJoCo is based on a unique formulation of contact dynamics which can be solved by a direct Newton method. Its quadratic convergence enables it to compute contact forces accurate to numerical precision, usually in less than 5 iterations even without warm-start. Of course the per-iteration cost is higher than first-order methods such as Gauss-Seidel, but the trade-off is favorable in most models, which is why it is the default solver. For situations where lower accuracy has to be accepted in the name of speed, MuJoCo also offers a Projected Gauss-Seidel (PGS) method similar to other engines, and a Conjugate Gradient (CG) method unique to MuJoCo.

Here we illustrate the convergence of the three solvers, on the humanoid model used earlier and on a more challenging model which is a Jenga tower. It is actually possible to play Jenga in MuJoCo. This requires the blocks to be slightly asymmetric, otherwise friction makes the entire tower fall whenever a block is pulled. The asymmetry causes an irregular pattern of active contacts (see transparent image below) which in turn makes it difficult to compute contact forces accurately.

The following plots are obtained from MuJoCo's built-in profiler. Warm-starts are disabled in all cases. The red curve is the reduction in residual at each iteration of the solver. The green and blue curves are the residual gradient and the residual slope in linesearch; these are not available in PGS.

Newton | Conjugate Gradient | Projected Gauss Seidel | |

Humanoid | |||

Jenga |

We see the quadratic convergence of Newton and the linear convergence of CG and PGS. CG has higher slope than PGS and converges faster, with similar per-iteration cost. Note the poor performance of PGS on the Jenga test. Even after 1000 iterations (not shown) the change-in-residual of PGS is still 1e-4, while CG reaches the tolerance of 1e-15 in about 300 iterations, and Newton reaches it in 5 iterations. We have observed similar behavior in models where heavy bodies and light bodies interact through contact; this occurs for example when the fingertips of a robotic hand try to grasp a heavy object.

## Comparison to Flex

Flex is a new simulator from NVidia. It represents the system as a collection of particles and imposes soft constraints among them, similar in spirit to the way MuJoCo 2.0 represents flexible objects. In fact MuJoCo has always supported soft constraints so it could have simulated flexible objects 8 years earlier, but those models are too big to write by hand, and we did not implement an automated model generator earlier. Flex also has a separate subsystem for rigid body dynamics. As explained in the Overview chapter of the documentation, our previous attempts to port MuJoCo to GPUs were not competitive with the CPU version. So it is interesting to compare to Flex. In a recent presentation from NVidia at SIMPAR 2018 it was shown that Flex can simulate around 1000 orange MuJoCo humanoids in realtime on a GTX 1080. We will use this number as a reference for an indirect comparison.

How many humanoids can MuJoCo simulate on a CPU? The NVidia tests were done in the context of locomotion where the number of contacts is smaller compared to lying on the floor, but at the same time the solution changes so warm-starting is less useful. Therefore we added an elastic string to keep our humanoid upright, and made the string anchor point move around to simulate dynamic conditions. We set the simulation timestep to 20ms. This is large, however a key advantage of MuJoCo's contact model is that it can be used with large timesteps even in tasks that require accuracy; see Comparison to Bullet, Havoc, ODE and PhysX below. Flex uses a timestep of around 8ms in this case; whether it can be made larger remains to be seen.

On the same i9-7900X processor as before, we measured 33,800 evaluations per second in a single thread and 387,700 in 20 threads. Given the 20ms simulation timestep, this means that a single core of an Intel CPU can simulate 675 humanoids in realtime, and all 10 cores can simulate 7755 humanoids in realtime. This is substantially faster than Flex on the GPU, even though the GPU has roughly double the transistor count and power consumption. We should emphasize here that the comparison we are making is indirect. The timestep is different, as already discussed. Another difference is that the Flex example involves a large scene with many humanoids, which rarely contact each other but nevertheless contact pruning becomes more expensive. On the other hand, it is possible that large scenes are needed to mask latencies - in which case the GPU would be trading latency for collision overhead. The latter point remains to be clarified, but what we do know is that MuJoCo users can obtain maximum performance without having to create large scenes.

What about particle simulations? We have no doubt that the GPU will win by a large margin in that case. There is a continuum of physical systems, from decoupled systems where GPUs win, to strongly coupled systems where CPUs win. Somewhere in the middle of this continuum lies a gray zone which is just beginning to be explored in physics simulation. Systems in that zone have large numbers of elements coupled with soft yet relatively strong constraints, which may or may not be tractable with the first-order Gauss-Seidel methods used on GPUs (note that MuJoCo has a direct Newton method which is essentially exact). The flexible objects introduced in Flex as well as MuJoCo 2.0 likely live in that zone, and so a side-by-side comparison on such models will be very interesting but remains to be done.

There are several additional considerations regarding CPU vs. GPU, orthogonal to the comparison between MuJoCo and Flex. First, computational research often relies on cloud computing where GPU instances remain expensive. Second, using GPUs involves PCI latencies and added software complexity. Third, during development users usually need a single simulation to run as fast as possible - so as to enable interactive design of models and algorithms. Computing resources are expensive, but salaries are even more expensive.

## Importance of model design

MuJoCo has a powerful modeling language, with many element types that can interact in complex ways. This enables users to create elaborate models and customize them as desired, but can also lead to models that do not map well to the MuJoCo framework and have reduced performance or unexpected behavior. The online documentation contains a lot of information on how to design good models.

A common mistake is to create dummy bodies for every joint, sensor or spatial location of interest. A MuJoCo body is not the same thing as a body in a gaming engine. Instead, it is a container for other elements. To help users avoid this mistake, MuJoCo 2.0 introduced the "fusestatic" compiler optimization where bodies that cannot move relative to their parent are fused with the parent. We illustrate the effect on the Anymal model used at ETH Zurich (thanks to Jemin Hwangbo for providing the URDF).

We import this URDF in MuJoCo 2.0 with and without the fusestatic optimization, and test simulator performance. Without fusestatic, we obtain 182,000 samples per second (single thread). With fusestatic the result is 279,000 samples per second. So fusing the dummy bodies (without changing the physics in any way) results in 53% speedup. If we add a floating joint at the base, we get 231,000 samples per second. This is slightly faster than the results for RaiSim from ETH Zurich in their energy test.

The above mentioned test suite makes other claims about MuJoCo, some of which are accurate - in particular the absence of proper restitution in earlier versions (now present in MuJoCo 2.0). Others are based on arbitrary definitions of accuracy - for example using penetration to compare a soft and a hard model (of course the soft model allows penetrations, that is what it is designed to do, and indeed many objects in the world are somewhat soft). Yet others are based on poorly designed MuJoCo models. For example, balls rolling on a horizontal platform will jump a bit if you make the platform slip on the ground (what this buys you is a convex, smooth and invertible contact model with benefits not found elsewhere). But instead of making the platform slip, one can model the platform as being mounted on a sliding joint, which is the desired effect anyway. Then the contact jumps disappear and the balls roll perfectly, as shown in the video below. Furthermore they stop when the platform stops - which is a known property of this physical system and can be used for accuracy testing. Another example is using MuJoCo's noslip solver to make a quadruped model stand. The noslip solver makes MuJoCo similar to traditional engines based on LCP formulations and Gauss-Seidel methods, and destroys some of the unique properties of MuJoCo's contact model, and is rarely needed. Instead one can use the impratio option and elliptic friction cones to reduce slip, to the point where any remaining slip becomes irrelevant in practice. This is illustrated in the video below, together with the new restitution functionality.

## Comparison to SD/FAST

Data from: MuJoCo: A physics engine for model-based control

E. Todorov, T. Erez, and Y. Tassa (2012). *International Conference on Intelligent Robots and Systems*

SD/FAST is a cross-compiler that generates model-specific C code for simulating multi-joint dynamics. Even though it has not been updated in nearly 20 years, it has remained the gold standard in many ways. This is partly because model-specific C code is hard to compete with, and partly because at the time people used to calculate the exact number of floating point operations and optimize numerical performance aggressively. This is why when we began working on MuJoCo we did not aim to outperform SD/FAST, but rather to come close to it in terms of multi-joint dynamics while adding contact dynamics and other important features.

We compared MuJoCo and SD/FAST on four model systems: isolated bodies, a chain, a hand, and a humanoid model. To our pleasant surprise, the outcome was a tie: MuJoCo was faster on some tests while SD/FAST was faster on others. The table below shows the number of evaluations per second, rounded to 1000, on one core of a now-outdated Intel processor (average over X9650, X5570, i7 940, X5860).

Model (DOF) | SD/FAST eval/s | MuJoCo eval/s |

Isolated (34) | 122,000 | 151,000 |

Chain (31) | 128,000 | 103,000 |

Hand (32) | 89,000 | 101,000 |

Humanoid (29) | 130,000 | 123,000 |

Each evaluation was performed at a randomly-generated state (same for both engines), and included computation of the joint-space inertia matrix and the vector of Coriolis, centripetal and gravitational forces. Once these quantities are available, physics simulation comes down to solving a linear system. Note that we did not use the O(N) solver in SD/FAST, because that solver does not compute the inertia matrix which is essential for simulating contact dynamics. Instead we used the solver which implements similar algorithms to those used in MuJoCo.

## Comparison to Bullet, Havoc, ODE and PhysX

Data from: Simulation tools for model-based robotics: Comparison of Bullet, Havok, MuJoCo, ODE and PhysX

T. Erez, Y. Tassa and E. Todorov (2015). *International Conference on Robotics and Automation*

Bullet, Havoc, ODE and PhysX are among the most popular engines used today for gaming, animation and multi-body dynamics simulation. They come from the tradition of the now-retired MathEngine, and represent the system in Cartesian coordinates (with 6 DOFs per body) while joints are represented as equality constraints imposed numerically. This is natural for systems involving many floating bodies. But for robotic and biomechanical systems where the DOF-to-body ratio is typically close to 1, a joint coordinate representation (as in MuJoCo and SD/FAST) is more efficient computationally, and also more accurate - because the joints cannot dislocate and thus the engine does not have to fix constraint violations at each step.

Our empirical results are consistent with the above reasoning. For multi-joint systems relevant to robotics and biomechanics MuJoCo is the fastest, while for systems composed of many floating bodies MuJoCo is the slowest and ODE is the fastest. Simulation speed however is not the most relevant measure of performance. Speed can always be increased by using a larger timestep, but this comes at the expense of accuracy and stability. So the more relevant measure is the speed-accuracy trade-off of each engine. We define accuracy by running the simulation with a very small timestep (0.016 ms) where all engines should be very accurate, and then measuring the deviations from the engine-specific reference as the timestep increases. We find that even for systems that are better suited for gaming engines, MuJoCo is the fastest at a fixed accuracy level and the most accurate at a fixed speed level. This is because it can use much larger timesteps without loosing accuracy.

Note that the above measure of accuracy is with respect to an engine-specific reference and not an analytical benchmark. This is because we want to have a measure that can be applied to complex systems where analytical results are not available. As a result, an engine can simulate "physics" that deviate from reality without being penalized. For example, both Havoc and PhysX ignore Coriolis forces. But since they do that for all timesteps, our accuracy measure (which really reflects self-consistency) does not penalize them.

The figure below shows four test systems we simulated in each engine (left column), raw speed (middle column), and speed-accuracy trade-off (right column). MuJoCo has two flavors depending on the integrator being used: semi-implicit Euler vs. 4th-order Runge Kutta. For one of the test systems (planar chain) we were also able to use the new version of Bullet (labeled BulletMB) which operates in joint coordinates; for the other systems it seemed to need more work before it can be used reliably. PhysX also has a version aimed at simulating multi-joint systems, but it does not use joint coordinates; instead it relies on the underlying Cartesian representation and has similar behavior as the standard version.

Some of the speed-accuracy curves are incomplete because when a given engine diverges at a given timestep we cannot obtain a measurement. The hand grasping an object was the most challenging test system, and is also very relevant to robotics. To measure the maximum timestep at which each engine could simulate this system without dropping the object, we started at timestep 1/64 ms and increased it by factor of 2 until the object was dropped. The table below shows the last successful timestep for each engine. Note the order-of-magnitude difference between MuJoCo and the best-performing gaming engine in this test (PhysX).

Engine | Max timestep |

Bullet | 0.031 ms |

MuJoCo | 16 ms |

ODE | 0.25 ms |

PhysX | 2 ms |