Close

Are You Still Using Real Data to Train Your AI?

Nvidia’s Rev Lebaredian says synthetic data can make AI systems better and maybe even more ethical

9 min read
A computer-generated street scene shows cars that are outlined in boxes and a red-colored pedestrian crossing the road.

Nvidia argues that synthetic data is vital to the training of self-driving cars.

Nvidia

It may be counterintuitive. But some argue that the key to training AI systems that must work in messy real-world environments, such as self-driving cars and warehouse robots, is not, in fact, real-world data. Instead, some say, synthetic data is what will unlock the true potential of AI. Synthetic data is generated instead of collected, and the consultancy Gartner has estimated that 60 percent of data used to train AI systems will be synthetic. But its use is controversial, as questions remain about whether synthetic data can accurately mirror real-world data and prepare AI systems for real-world situations.

Nvidia has embraced the synthetic data trend, and is striving to be a leader in the young industry. In November, Nvidia founder and CEO Jensen Huang announced the launch of the Omniverse Replicator, which Nvidia describes as “an engine for generating synthetic data with ground truth for training AI networks.” To find out what that means, IEEE Spectrum spoke with Rev Lebaredian, vice president of simulation technology and Omniverse engineering at Nvidia.

Rev Lebaredian on...

The Omniverse Replicator is described as “a powerful synthetic data generation engine that produces physically simulated synthetic data for training neural networks.” Can you explain what that means, and especially what you mean by “physically simulated”?

A man with a beard and mustache Rev Lebaredian Nvidia

Rev Lebaredian: Video games are essentially simulations of fantastic worlds. There are attempts to make the physics of games somewhat realistic: When you blow up a wall or a building, it crumbles. But for the most part, games aren’t trying to be truly physically accurate, because that’s computationally very expensive. So it’s always about: What approximations are you willing to do in order to make it tractable as a computing problem? A video game typically has to run on a small computer, like a console or even on a phone. So you have those severe constraints. The other thing with games is that they’re fantasy worlds and they’re meant to be fun, so real-world physics and accuracy is not necessarily a great thing.

With Omniverse, our goal is to do something that really hasn’t been done before in real-time world simulators. We’re trying to make a physically accurate simulation of the world. And when we say physically accurate, we mean all aspects of physics that are relevant. How things look in the physical world is the physics of how light interacts with matter, so we simulate that. We simulate how atoms interact with each other with rigid-body physics, soft-body physics, fluid dynamics, and whatever else is relevant. Because we believe that if you can simulate the real world closely enough, then you gain superpowers.

What kind of superpowers?

Lebaredian: First, you get teleportation. If I can take this room around me and represent it in a virtual world, now I can move my camera around in that world and teleport to any location. I can even put on a VR headset and feel like I’m inside it. And if I can synchronize the state of the real world with the virtual one, then there’s really no difference. I might have sensors on Mars that ingest the real world and send over a copy of that info to Earth in real time—or 8 minutes later or whatever it takes for the speed of light to travel from Mars. If I can reconstruct that world virtually and immerse myself in it, then effectively it’s like I’m teleporting to Mars 8 minutes ago.

And given some initial conditions about the state of the world, if you can simulate accurately enough, then you can potentially predict the future. Say I have the state of the world right now in this room and I’m holding this phone up. I can simulate what happens the moment I let go and it falls—and if my simulation is close enough, then I can predict how this phone is going to fall and hit the ground. What’s really cool about that is you can change the initial conditions and do some experiments. You can say, What can alternate futures look like? What if I reconfigure my factory or make different decisions about how I manipulate things in my environment? What would these different futures look like? And that allows you to do optimizations. You can find the best future.

Back to top

Okay, so that’s what you’re trying to build with Omniverse. How does all this help with AI?

Lebaredian: In this new era of AI, developing advanced software is no longer something that just a grad student with a laptop can do. It requires serious investment. All the most advanced algorithms that mankind will develop in the future are going to be trained by systems that require a lot of data. That’s why people say data is the new oil. And it seems like the big tech companies that collect data have a natural advantage. But the truth is that for most of the AI that we’re going to create in the future, none of the data we have collected is that useful.

I noticed it when we did a demo for [the conference] SIGGRAPH 2017. We had a robot that could play dominoes, and we had multiple AI models that we had to train. One of the basic ones was a computer-vision model that could detect the dominoes that were on the table, tell you their orientation, and then tell you how many pips were on each domino: one, five, six, or whatever.

Surely Google would have all the image data you need to train such an AI.

Lebaredian: You can search Google images and you’ll find lots of pictures of dominoes, but what you’ll find is, first of all, none of them are labeled. A human has to label what each domino is and the side of each domino, and that’s a whole bunch of manual labor. But even if you get past the labeling, you’ll find that the images don’t have much diversity. We needed our algorithm to be robust to different lighting conditions because we were going to train it in our lab, but then take it to the show floor at SIGGRAPH. The cameras and sensors we used might also change, so the conditions around those could be different. We wanted the algorithm to work with any type of dominoes, whether they’re plastic or wood or whatever material. So even for this really simple thing, the necessary data just didn’t exist. If we were to go collect that data, we’d have to buy dozens or maybe hundreds of different dominos sets, set up different lighting conditions and different sensors and all of that. So, back then, we quickly coded off in a game engine a random domino generator that randomized all of that stuff. And overnight we trained a model that could do this robustly, and it worked in the convention center with different cameras.

That’s one simple case. For something more complex like self-driving cars or autonomous machines, the amount of data that we need, and the accuracy and diversity of that data, is just impossible to get from the real world. There’s really no way around it. Without physically accurate simulation to generate the data we need for these AIs, there’s no way we’re going to progress.

With Omniverse Replicator, are customers getting a one-size-fits-all synthetic data generator? Or are you tailoring it for different industries?

Lebaredian: What we’re building with Omniverse is a very general development platform that anyone can take and customize for their particular needs. Out of the box you get multiple renderers, which are simulators of the physics of light and matter. You get a spectrum of them that let you trade off accuracy for speed.

We have a bunch of ways to bring in 3D data as inputs to Omniverse Replicator to generate the data that you need. For pretty much everything that’s man-made these days, there’s a 3D virtual representation of it somewhere. If you’re designing a car, a phone, a building, a bridge, or whatever, you use a CAD tool. The problem is that all these tools speak different languages. The data is in different formats. It’s very hard to combine them and build a scene that has all those constituent parts.

With Omniverse, we’ve gone through the trouble of trying to connect all of these existing tools and harmonizing them. We built Omniverse on top of a system called universal scene description that was originally developed by Pixar and later open-sourced. We think USD is to virtual worlds as HTML is to Web pages: It’s a common way to describe things. We built a lot of tools around USD to let users transform the data, modify it, randomize things. But the source data can come from virtually anywhere because we have connectors to all the different tools that are relevant.

Back to top

Can you give me an example of an industry that would use Replicator to make synthetic data for AI training?

Lebaredian: We’ve shown the example of autonomous vehicles. There’s a lot of money going into figuring out how to make vehicles drive themselves, and synthetic data is becoming a major part of training the AI systems. We’ve already done some specialization within Omniverse Replicator for this domain: We have big outdoor worlds with roads and lanes and cars and pedestrians and street signs and all that kind of stuff.

We’ve also done some specialization for robotics. But if we don’t support your domain out of the box, since it’s a tool kit, you can take it and do what you like with it. People have many paths to bring in their own 3D data or get data to construct virtual worlds. There are libraries and third-party 3D asset providers out there.

NVIDIA Omniverse Replicator For DRIVE Sim—Synthetic Data Generation www.youtube.com

For an autonomous vehicle company, an advantage of generating synthetic data is that it could train its vehicles on dangerous conditions, right? It can put in snow and ice, hard turns, that kind of thing?

Lebaredian: They can change day and night conditions and position pedestrians and animals in dangerous situations that you wouldn’t want to construct in the real world. We don’t want to put humans or animals in perilous situations in real life, but I sure do want my autonomous vehicle to know how to react to these types of fringe situations. So if we can train them in the virtual world where it’s safe first, we get the best of both worlds.

So this synthetic data can be used in AI training as “ground truth data” with built-in labels that are superaccurate. But is that the best training strategy? These AI systems often need to operate in the world with incomplete and imperfect information.

Lebaredian: It’s good for the training part. The way most AI is created today is through a type of learning called supervised learning. In the example of a neural network that can tell the difference between a cat and a dog, you first train it on pictures of cats and dogs that are labeled: This is a cat and this is a dog. It learns from those examples. Then you go apply that network on new images that aren’t labeled, and it will tell you what each one is.

For example, in autonomous vehicles you want your car to know, by looking through its sensors at the world, the relative 3D positions of all of the cars and pedestrians around it. But it’s just getting a 2D image that’s nothing but pixels; there’s no information about it. So if you’re going to train a network to infer that 3D information, you first have to draw a box around things in 2D and then you have to tell it, ‘Here’s how far away it is based on the particular lens that was used with that sensor.’ But if we synthesize the data in Omniverse, we have all of that 3D information at full physical accuracy. We can provide exact labeling without the errors that a human would introduce into the system. So the resulting neural network that we train is going to be smarter and more accurate.

Back to top

Is overfitting a problem in this context? Is there a danger that a system trained with synthetic data would perform well on synthetic data, but fail in the real world?

Lebaredian: Synthetic data is actually a great way to solve for the overfitting problem, because it’s much easier for us to provide a diverse data set. If we’re training a network to recognize people’s facial expressions, but we only train it on Caucasian males, then we’ve overfit to Caucasian males and it will fail when you give it more diverse subjects. Synthetic data doesn’t make that worse. But with synthetic data it’s easier for us to create diversity of data. If I’m generating images of humans and I have a synthetic data generator, that allows me to change the configurations of people’s faces, their skin tone, eye color, hairstyle, and all of those things.

It seems like synthetic data could help with the big problem of algorithmic bias, since one of the sources of algorithmic bias is bias in data sets used to train AI systems. Can we use synthetic data to train AIs in the unbiased world that we would prefer to live in, as opposed to the world we actually live in?

Lebaredian: We’re synthesizing the worlds that our AIs are born in. They are born inside a computer and they’re just trained on whatever data we give them. So we can construct ideal worlds with the diversity that we want, and our AIs can be better for it. By the time they’re done, they are more intelligent than anybody we have out here in the real world. And when we put them in the real world, they behave better than they would have if they were only trained on what they see out here.

So what are the pitfalls to using synthetic data? Is it susceptible to adversarial attacks?

Lebaredian: Adversarial attacks, similar to overfitting problems, are not something that’s unique to synthetic data versus any other kind of data. The solution is to just have more data and better data.

The problem with synthetic data is that generating good synthetic data is hard. It requires you having a great simulator like Omniverse and one that is physically accurate so it can match the real world well enough. If we create a synthetic data generator that makes images that look like cartoons, that’s not going to be good enough. You wouldn’t want to put a robot that only knows how to interpret cartoon worlds in a hospital where it’s going to work with the elderly and children. That would be a scary thing to do. You need your simulator to be as physically accurate as possible to make use of this. But it is an extremely difficult problem.

Back to top

The Conversation (1)
Sort by
William Stewart19 Feb, 2022
INDV

While the use of synthetic data may be of utility in some relatively simple data sets, it is still very immature in many areas of more complex data, such as precision medicine. Attempting to create synthetic data across a person's electronic health record, multi-omic data (genomic, transcriptomic, proteomic, etc), wearables (e.g., Fitbit, etc), and other sources has a very long way to go. And most forms of synthetic data would need to be refreshed frequently anyway due to concept drift and data drifts.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276128/

The Femtojoule Promise of Analog AI

To cut power by orders of magnitude, do your processing with analog circuits

12 min read
A list of characters and letters in blue, white black and yellow.

Machine learning and artificial intelligence (AI) have already penetrated so deeply into our life and work that you might have forgotten what interactions with machines used to be like. We used to ask only for precise quantitative answers to questions conveyed with numeric keypads, spreadsheets, or programming languages: "What is the square root of 10?" "At this rate of interest, what will be my gain over the next five years?"

But in the past 10 years, we've become accustomed to machines that can answer the kind of qualitative, fuzzy questions we'd only ever asked of other people: "Will I like this movie?" "How does traffic look today?" "Was that transaction fraudulent?"

Deep neural networks (DNNs), systems that learn how to respond to new queries when they're trained with the right answers to very similar queries, have enabled these new capabilities. DNNs are the primary driver behind the rapidly growing global market for AI hardware, software, and services, valued at US $327.5 billion this year and expected to pass $500 billion in 2024, according to the International Data Corporation.

Convolutional neural networks first fueled this revolution by providing superhuman image-recognition capabilities. In the last decade, new DNN models for natural-language processing, speech recognition, reinforcement learning, and recommendation systems have enabled many other commercial applications.

But it's not just the number of applications that's growing. The size of the networks and the data they need are growing, too. DNNs are inherently scalable—they provide more reliable answers as they get bigger and as you train them with more data. But doing so comes at a cost. The number of computing operations needed to train the best DNN models grew 1 billionfold between 2010 and 2018, meaning a huge increase in energy consumption And while each use of an already-trained DNN model on new data—termed inference—requires much less computing, and therefore less energy, than the training itself, the sheer volume of such inference calculations is enormous and increasing. If it's to continue to change people's lives, AI is going to have to get more efficient.

We think changing from digital to analog computation might be what's needed. Using nonvolatile memory devices and two fundamental physical laws of electrical engineering, simple circuits can implement a version of deep learning's most basic calculations that requires mere thousandths of a trillionth of a joule (a femtojoule). There's a great deal of engineering to do before this tech can take on complex AIs, but we've already made great strides and mapped out a path forward.

AI’s Fundamental Function

Column of three yellow dots connected to a blue dot with an arrow pointing to an output.

The most basic computation in an artificial neural network is called multiply and accumulate. The output of artificial neurons [left, yellow] are multiplied by the weight values connecting them to the next neuron [center, light blue]. That neuron sums its inputs and applies an output function. In analog AI, the multiply function is performed by Ohm's Law, where the neuron's output voltage is multiplied by the conductance representing the weight value. The summation at the neuron is done by Kirchhoff's Current Law, which simply adds all the currents entering a single node

The biggest time and energy costs in most computers occur when lots of data has to move between external memory and computational resources such as CPUs and GPUs. This is the "von Neumann bottleneck," named after the classic computer architecture that separates memory and logic. One way to greatly reduce the power needed for deep learning is to avoid moving the data—to do the computation out where the data is stored.

DNNs are composed of layers of artificial neurons. Each layer of neurons drives the output of those in the next layer according to a pair of values—the neuron's "activation" and the synaptic "weight" of the connection to the next neuron.

Most DNN computation is made up of what are called vector-matrix-multiply (VMM) operations—in which a vector (a one-dimensional array of numbers) is multiplied by a two-dimensional array. At the circuit level these are composed of many multiply-accumulate (MAC) operations. For each downstream neuron, all the upstream activations must be multiplied by the corresponding weights, and these contributions are then summed.

Most useful neural networks are too large to be stored within a processor's internal memory, so weights must be brought in from external memory as each layer of the network is computed, each time subjecting the calculations to the dreaded von Neumann bottleneck. This leads digital compute hardware to favor DNNs that move fewer weights in from memory and then aggressively reuse these weights.

A radical new approach to energy-efficient DNN hardware occurred to us at IBM Research back in 2014. Together with other investigators, we had been working on crossbar arrays of nonvolatile memory (NVM) devices. Crossbar arrays are constructs where devices, memory cells for example, are built in the vertical space between two perpendicular sets of horizontal conductors, the so-called bitlines and the wordlines. We realized that, with a few slight adaptations, our memory systems would be ideal for DNN computations, particularly those for which existing weight-reuse tricks work poorly. We refer to this opportunity as "analog AI," although other researchers doing similar work also use terms like "processing-in-memory" or "compute-in-memory."

There are several varieties of NVM, and each stores data differently. But data is retrieved from all of them by measuring the device's resistance (or, equivalently, its inverse—conductance). Magnetoresistive RAM (MRAM) uses electron spins, and flash memory uses trapped charge. Resistive RAM (RRAM) devices store data by creating and later disrupting conductive filamentary defects within a tiny metal-insulator-metal device. Phase-change memory (PCM) uses heat to induce rapid and reversible transitions between a high-conductivity crystalline phase and a low-conductivity amorphous phase.

Flash, RRAM, and PCM offer the low- and high-resistance states needed for conventional digital data storage, plus the intermediate resistances needed for analog AI. But only RRAM and PCM can be readily placed in a crossbar array built in the wiring above silicon transistors in high-performance logic, to minimize the distance between memory and logic.

We organize these NVM memory cells in a two-dimensional array, or "tile." Included on the tile are transistors or other devices that control the reading and writing of the NVM devices. For memory applications, a read voltage addressed to one row (the wordline) creates currents proportional to the NVM's resistance that can be detected on the columns (the bitlines) at the edge of the array, retrieving the stored data.

To make such a tile part of a DNN, each row is driven with a voltage for a duration that encodes the activation value of one upstream neuron. Each NVM device along the row encodes one synaptic weight with its conductance. The resulting read current is effectively performing, through Ohm's Law (in this case expressed as "current equals voltage times conductance"), the multiplication of excitation and weight. The individual currents on each bitline then add together according to Kirchhoff's Current Law. The charge generated by those currents is integrated over time on a capacitor, producing the result of the MAC operation.

These same analog in-memory summation techniques can also be performed using flash and even SRAM cells, which can be made to store multiple bits but not analog conductances. But we can't use Ohm's Law for the multiplication step. Instead, we use a technique that can accommodate the one- or two-bit dynamic range of these memory devices. However, this technique is highly sensitive to noise, so we at IBM have stuck to analog AI based on PCM and RRAM.

Unlike conductances, DNN weights and activations can be either positive or negative. To implement signed weights, we use a pair of current paths—one adding charge to the capacitor, the other subtracting. To implement signed excitations, we allow each row of devices to swap which of these paths it connects with, as needed.

Nonvolatile Memories for Analog AI

Layers of colored lines with red dots in the 2nd row which is blue.

​Phase-change memory's conductance is set by the transition between a crystalline and an amorphous state in a chalcogenide glass. In resistive RAM, conductance depends on the creation and destruction of conductive filaments in an insulator.

Three rows of color with the middle tan layer having white dots labeled \u201cVacancy.\u201d

In resistive RAM, conductance depends on the creation and destruction of conductive filaments in an insulator.

Rows of colors with red dots.

Flash memory stores data as charge trapped in a "floating gate." The presence or absence of that charge modifies conductances across the device.

Rows of colors with plus and minus icons and a label that says \u201cElectrochemical RAM\u201d

Electrochemical RAM acts like a miniature battery. Pulses of voltage on a gate electrode modulate the conductance between the other two terminals by the exchange of ions through a solid electrolyte.

With each column performing one MAC operation, the tile does an entire vector-matrix multiplication in parallel. For a tile with 1,024 × 1,024 weights, this is 1 million MACs at once.

In systems we've designed, we expect that all these calculations can take as little as 32 nanoseconds. Because each MAC performs a computation equivalent to that of two digital operations (one multiply followed by one add), performing these 1 million analog MACs every 32 nanoseconds represents 65 trillion operations per second.

We've built tiles that manage this feat using just 36 femtojoules of energy per operation, the equivalent of 28 trillion operations per joule. Our latest tile designs reduce this figure to less than 10 fJ, making them 100 times as efficient as commercially available hardware and 10 times better than the system-level energy efficiency of the latest custom digital accelerators, even those that aggressively sacrifice precision for energy efficiency.

It's been important for us to make this per-tile energy efficiency high, because a full system consumes energy on other tasks as well, such as moving activation values and supporting digital circuitry.

There are significant challenges to overcome for this analog-AI approach to really take off. First, deep neural networks, by definition, have multiple layers. To cascade multiple layers, we must process the VMM tile's output through an artificial neuron's activation—a nonlinear function—and convey it to the next tile. The nonlinearity could potentially be performed with analog circuits and the results communicated in the duration form needed for the next layer, but most networks require other operations beyond a simple cascade of VMMs. That means we need efficient analog-to-digital conversion (ADC) and modest amounts of parallel digital compute between the tiles. Novel, high-efficiency ADCs can help keep these circuits from affecting the overall efficiency too much. Recently, we unveiled a high-performance PCM-based tile using a new kind of ADC that helped the tile achieve better than 10 trillion operations per watt.

A second challenge, which has to do with the behavior of NVM devices, is more troublesome. Digital DNNs have proven accurate even when their weights are described with fairly low-precision numbers. The 32-bit floating-point numbers that CPUs often calculate with are overkill for DNNs, which usually work just fine and with less energy when using 8-bit floating-point values or even 4-bit integers. This provides hope for analog computation, so long as we can maintain a similar precision.

Given the importance of conductance precision, writing conductance values to NVM devices to represent weights in an analog neural network needs to be done slowly and carefully. Compared with traditional memories, such as SRAM and DRAM, PCM and RRAM are already slower to program and wear out after fewer programming cycles. Fortunately, for inference, weights don't need to be frequently reprogrammed. So analog AI can use time-consuming write-verification techniques to boost the precision of programming RRAM and PCM devices without any concern about wearing the devices out.

That boost is much needed because nonvolatile memories have an inherent level of programming noise. RRAM's conductivity depends on the movement of just a few atoms to form filaments. PCM's conductivity depends on the random formation of grains in the polycrystalline material. In both, this randomness poses challenges for writing, verifying, and reading values. Further, in most NVMs, conductances change with temperature and with time, as the amorphous phase structure in a PCM device drifts, or the filament in an RRAM relaxes, or the trapped charge in a flash memory cell leaks away.

There are some ways to finesse this problem. Significant improvements in weight programming can be obtained by using two conductance pairs. Here, one pair holds most of the signal, while the other pair is used to correct for programming errors on the main pair. Noise is reduced because it gets averaged out across more devices.

We tested this approach recently in a multitile PCM-based chip, using both one and two conductance pairs per weight. With it, we demonstrated excellent accuracy on several DNNs, even on a recurrent neural network, a type that's typically sensitive to weight programming errors.

Vector-Matrix Multiplication with Analog AI

Column of colored dots connected by blue lines.

Vector-matrix multiplication (VMM) is the core of a neural network's computing [top]; it is a collection of multiply-and-accumulate processes. Here the activations of artificial neurons [yellow] are multiplied by the weights of their connections [light blue] to the next layer of neurons [green].

Rows of white square with yellow, blue and green dots around the outside.

For analog AI, VMM is performed on a crossbar array tile [center]. At each cross point, a nonvolatile memory cell encodes the weight as conductance. The neurons' activations are encoded as the duration of a voltage pulse. Ohm's Law dictates that the current along each crossbar column is equal to this voltage times the conductance. Capacitors [not shown] at the bottom of the tile sum up these currents. A neural network's multiple layers are represented by converting the output of one tile into the voltage duration pulses needed as the input to the next tile [right].

Different techniques can help ameliorate noise in reading and drift effects. But because drift is predictable, perhaps the simplest is to amplify the signal during a read with a time-dependent gain that can offset much of the error. Another approach is to use the same techniques that have been developed to train DNNs for low-precision digital inference. These adjust the neural-network model to match the noise limitations of the underlying hardware.

As we mentioned, networks are becoming larger. In a digital system, if the network doesn't fit on your accelerator, you bring in the weights for each layer of the DNN from external memory chips. But NVM's writing limitations make that a poor decision. Instead, multiple analog AI chips should be ganged together, with each passing the intermediate results of a partial network from one chip to the next. This scheme incurs some additional communication latency and energy, but it's far less of a penalty than moving the weights themselves.

Until now, we've only been talking about inference—where an already-trained neural network acts on novel data. But there are also opportunities for analog AI to help train DNNs.

DNNs are trained using the backpropagation algorithm. This combines the usual forward inference operation with two other important steps—error backpropagation and weight update. Error backpropagation is like running inference in reverse, moving from the last layer of the network back to the first layer; weight update then combines information from the original forward inference run with these backpropagated errors to adjust the network weights in a way that makes the model more accurate.

The Tiki-Taka Solution

Analog AI can reduce the power consumption of training neural networks, but because of some inherent characteristics of the nonvolatile memories involved, there are some complications. Nonvolatile memories, such as phase-change memory and resistive RAM, are inherently noisy. What's more, their behavior is asymmetric. That is, at most points on their conductance curve, the same value of voltage will produce a different change in conductance depending on the voltage's polarity.

One solution we came up with, the Tiki-Taka algorithm, is a modification to backpropagation training. Crucially, it is significantly more robust to noise and asymmetric behavior in the NVM conductance. This algorithm depends on RRAM devices constructed to conduct in both directions. Each of these is initialized to their symmetry point—the spot on their conductance curve where the conductance increase and decrease for a given voltage are exactly balanced. In Tiki-Taka, the symmetry-point-balanced NVM devices are involved in weight updates to train the network. Periodically, their conductance values are programmed onto a second set of devices, and the training devices are returned to their natural symmetry point. This allows the neural network to train to high accuracy, even in the presence of noise and asymmetry that would completely disrupt the conventional backpropagation algorithm.

The backpropagation step can be done in place on the tiles but in the opposite manner of inferencing—applying voltages to the columns and integrating current along rows. Weight update is then performed by driving the rows with the original activation data from the forward inference, while driving the columns with the error signals produced during backpropagation.

Training involves numerous small weight increases and decreases that must cancel out properly. That's difficult for two reasons. First, recall that NVM devices wear out with too much programming. Second, the same voltage pulse applied with opposite polarity to an NVM may not change the cell's conductance by the same amount; its response is asymmetric. But symmetric behavior is critical for backpropagation to produce accurate networks. This is only made more challenging because the magnitude of the conductance changes needed for training approaches the level of inherent randomness of the materials in the NVMs.

There are several approaches that can help here. For example, there are various ways to aggregate weight updates across multiple training examples, and then transfer these updates onto NVM devices periodically during training. A novel algorithm we developed at IBM, called Tiki-Taka, uses such techniques to train DNNs successfully even with highly asymmetric RRAM devices. Finally, we are developing a device called electrochemical random-access memory (ECRAM) that can offer not just symmetric but highly linear and gradual conductance updates.

The success of analog AI will depend on achieving high density, high throughput, low latency, and high energy efficiency—simultaneously. Density depends on how tightly the NVMs can be integrated into the wiring above a chip's transistors. Energy efficiency at the level of the tiles will be limited by the circuitry used for analog-to-digital conversion.

But even as these factors improve and as more and more tiles are linked together, Amdahl's Law—an argument about the limits of parallel computing—will pose new challenges to optimizing system energy efficiency. Previously unimportant aspects such as data communication and the residual digital computing needed between tiles will incur more and more of the energy budget, leading to a gap between the peak energy efficiency of the tile itself and the sustained energy efficiency of the overall analog-AI system. Of course, that's a problem that eventually arises for every AI accelerator, analog or digital.

The path forward is necessarily different from digital AI accelerators. Digital approaches can bring precision down until accuracy falters. But analog AI must first increase the signal-to-noise ratio (SNR) of the internal analog modules until it is high enough to demonstrate accuracy equivalent to that of digital systems. Any subsequent SNR improvements can then be applied toward increasing density and energy efficiency.

These are exciting problems to solve, and it will take the coordinated efforts of materials scientists, device experts, circuit designers, system architects, and DNN experts working together to solve them. There is a strong and continued need for higher energy-efficiency AI acceleration, and a shortage of other attractive alternatives for delivering on this need. Given the wide variety of potential memory devices and implementation paths, it is quite likely that some degree of analog computation will find its way into future AI accelerators.

This article appears in the December 2021 print issue as "Ohm's Law + Kirchhoff's Current Law = Better AI."

Keep Reading ↓ Show less
{"imageShortcodeIds":[]}