Torchlight Solutions, LLC

For those considering hardware upgrades to get more out of Zemax OpticStudio, we share below our learnings from our recent hardware upgrade.

System

Our hardware upgrade included the following key components:

CPU: AMD Threadripper 3970X (32 cores, 3.7 GHz base clock)
Memory: 128 GB DDR4 3600
CPU Cooling: Noctua NH-U14S TR4-SP3 + NF-A15 PWM (Second Fan)
Motherboard: ASRock TRX40 Creator

OpticStudio Resource Utilization

The way in which Zemax OpticStudio allocates resources, including memory and cores/threads, happens automatically and can greatly effect performance. While the resource allocation algorithms are understandably not disclosed, Zemax noted to us that, “Zemax OpticStudio is designed to…automatically determine the optimum number of threads to launch for any given calculation, including during optimization. OpticStudio is always trying to use your resources as efficiently as possible (and this may mean using less than 100% of the available cores).” Whether or not all cores are used, ”…depends largely on the machine hardware, the .ZMX file itself, the number of variables, and the number and complexity of Merit Function Operands.” This makes predicting the magnitude of improvement challenging.

One hardware limitation can be available memory. “…[M]emory limitations…[may]…prevent or…slow down the system if all cores were utilized. There is overhead in multi-threading; for each core used in optimization, OpticStudio has to copy over and store the optical system in memory. If you have a memory-intensive system (some complex CAD objects, a high-density grid sag surface, etc.) then it might be slow to create this copy, or it is possible that there's simply not have enough system memory for each of the cores.” Testing for this limitations is easy (see below). In our experience, the Zemax recommendation of at least 2 GB of RAM per core is sufficient for all but corner cases.

Another potential limitation is, almost paradoxically, the simplicity of the merit function operands. “…[T]here is overhead in launching threads. If the Merit Function is simple and easy to compute (i.e. a Gaussian Quadrature Wavefront MF might be a few dozen or hundred ray traces), then it might be more efficient to simply run the optimization on a single core or just a couple cores.”

We were also told that, “…OpticStudio will only use as many cores as you have variables assigned…” This may have been true in the past, but this is generally not true as of Zemax v20.3.2 based on our testing.

Resource utilization also depends on the optimization algorithm used. “During optimization using the damped least squares algorithm, all threads will be used when evaluating the response of the optical system to changes in each system variable. This response function is used to evaluate the gradient of the merit function in solution space, allowing OpticStudio to take the ‘next step’ in the optimization. However, an important part of determining that ‘next step’ involves the calculation of a matrix determinant, and this calculation cannot be threaded. Depending on the system complexity and the number of variables defined, this single-threaded calculation may take a non-negligible amount of time.”

Particularly important for hammer optimization, ”…oftentimes, a more important effect in system optimizations is that the optical system must be updated and the merit function must be subsequently evaluated after that ‘next step’ has been taken. While any one operand in the merit function can be calculated using multiple CPUs – if appropriate – each operand in the merit function is currently evaluated in sequential order…[and thus]…cannot be done in parallel (or ‘threaded’) – we do not use one CPU to evaluate operand n while another CPU is simultaneously evaluating operand n+1. This is because there can be operand dependencies in the user’s Merit Function (i.e. OPGT, OPLT). This is often the reason why the CPU usage will intermittently go from 100% down to 1 core during optimization…”

Keep in mind, the resource allocation algorithm continues to change: ”Over the past few years, there have been a couple of releases in which there were improvements to the algorithm to decide how many threads to launch for optimization. For this reason, you might observe differences in CPU usage for the same file (and MF, variables) between versions. However, it should be the case that optimization in the newer versions is more efficient and runs more quickly than older iterations.”

Observations

With those details in mind, a few observations emerged using our particular hardware:

Hammer optimization uses all cores allocated to it for most, but not all, cases we’ve seen. Unsurprisingly, we find that using all available threads (64 in our case) leaves the system intermittently responsive; this is not recommended. Leaving 2-4 threads unused typically leaves the system sufficiently responsive, while maximizing the multithreading benefit.
Local optimization, either damped least squares or orthogonal descent, usually run single-threaded. The practical impact is that, for time efficiency, it becomes better to spend more time on hammer optimization than local optimization than previously.
For some file configurations, we observe that hammer optimization uses far fewer than all available threads. Because it is hard to know why, we recommend testing some variations in variable count and merit function operand structure when all allocated cores are not used. Minor changes can increase the thread usage and get you to results faster, particularly for relatively long optimization runs.
Otherwise, the combination of a fast clock speed for single threaded execution, and many cores allows us to run optimizations in hours instead of days, literally. As well, impractically slow optimizations (e.g. some designs using Physical Optics Propagation operands) are now feasible.

Unfortunately, assessing the value a given hardware upgrade will provide is a matter of testing. Provided the hardware is available, this can easily be done by observing the thread and memory utilization in Windows 10 in real time by opening Task Manager —> Performance Tab, and clicking on “CPU” and “Memory”, respectively (see below). Without the hardware available, one has little choice but to go on risk.