AI at Autodesk, an interview with Mike Haley

Head of R&D on Autodesk’s history and future with AI

Mike Haley, Senior vice president at Autodesk, is the head of R&D.

Mike Haley, Senior vice president at Autodesk, is the head of R&D.

Having embarked on a project to categorize levels of AI, much in the same way that SAE categorizes autonomous vehicles, from Level 1 to 5, with 5 being the most automated and where the vehicle has no steering wheel or pedals, I wanted to talk to key people in the design and engineering software industry to see if a similar categorization of levels would be helpful.

With the AI fever running high, not a day goes by without one company or another implementing AI and many of them claiming superiority over their competition in doing so, with each in their own way saying AI has already been implemented and more is soon to be. The public, on the other hand, dwells between being overly optimistic, as was suggested with “Hey ChatGPT, finish this building, and the outright pessimism of losing jobs, and in the worst case, lives. Think of SkyNet in the “Terminator.”

It is hard to sort out the claims without having clearly defined levels in place. Such levels would not only help editors but might also assuage designers and engineers who are afraid of losing their jobs to see levels still not attained by AI, levels at which they are still needed and useful.

We start with Mike Haley, who is in charge of Autodesk Research. Haley’s most important directive over the last year may have been AI. ChatGPT did that. And Autodesk has turned up the pressure. CEO Andrew Anagnost declared Autodesk the industry leader in generative AI at the company’s last quarterly call.

Here is a transcript of my conversation with Mike Haley.

RT: Hi, Mike. How long have you been at Autodesk?

MH: August this year, it will be 23 years.

You went with Carl Bass to Buzzsaw, right?

Yes, that is when I joined, in 2000. I was the architect. We were the crazy people doing cloud stuff in San Francisco back in 2000 that everybody thought would go away after a while.

But the cloud was the product. Buzzsaw was just ahead of its time. Now everybody’s on the cloud.

Isn’t it amazing?

You’re again on the forefront of technology. AI is the most exciting place to be now. We want to know what Autodesk is up to with AI. And we have very high expectations.

My original background was in computer graphics. I studied in the 80s when computer graphics was having its heyday. I thought I was never going to see that kind of energy in technology again. But we are seeing it even greater now with AI. I’m loving it. It’s an electrifying time. The possibilities are endless,

I’m trying to classify the amount of AI that’s in design or ought to be in design. CAD, by definition, is computer-aided design. It’s always had some intelligence. Let’s say we were starting at level zero or one. With CAD the way it is now, but at the highest level, CAD can design a car or building. So here we are, somewhere in between. I’m trying to distinguish, in my own way, what level of AI we’re at.

I don’t disagree with that kind of framework. I think it’s a useful framework. There’s quite a lot of nuance around it that is important to know. But to your immediate question, ChatGPT brought AI into common awareness. AI has driven levels of investment and activity and, most importantly, levels of acceptance. New technology that might have taken 5 years to establish happened in a matter of months. Technically, what’s under the hood of ChatGPT is unbelievably impressive. If you were an AI researcher working in this field, that would be a natural progression of what was already happening. Most AI researchers—and I’ll include myself in that category—didn’t fully realize how quickly it was going to become useful. Sometimes, it’s really hard to know when you are working directly with computer-aided design. At what level does “aided” actually become useful? At what level is it annoying? There’s a point where it unlocks the next level of creativity.

AI has really been in ongoing development since 2011. There were three big things that have flipped technology on its side. The first was the notion of unsupervised or self-supervised learning. When we started with AI in 2011, all training things was based on supervised learning, which means that if I’m going to train a neural network to understand a cat, I need to give it lots of pictures and label them cat and tell it every single time—that’s a cat or a dog.

That were a lot of humans actually tagging, correct?

Yes, exactly. That was manual labeling. When you do that, you’ve limited your ability to do things. There’s only so many people you’re going to find to do that. You can’t scale. You’re going to introduce bias that reflects the bias of the people. If you really want to start training very large models—more precisely, the number of neurons inside a model, the more data you need to train it on. There’s a direct relationship between the two. If you want a more intelligent model with more capacity, you’re going to have to find more data. Around 2015, we bumped into that limitation: we can build bigger models, but supervised training wasn’t able to handle it.

That was the beginning of unsupervised training?

Yes, or, as it is sometimes called, self-supervised training. For example, self-supervised training is giving an image, removing something from the image and training an algorithm to figure out what is missing, like filling in the missing pixels. That’s self-supervised because you have the original image, and you took something out of it. Or take a sentence and remove some of the words and train the model to fill in the missing words. That doesn’t require a whole bunch of people. You have the original sentence and you’ve got the sentence with words removed, It’s like de-noising a signal. It’s understanding the missing pieces. If you do that on a massive scale, these models learn unbelievably well. That was the first big thing that happened. The second big thing was the invention of the transformer. That’s the ‘T’ in ChatGPT. ChatGPT stands for chat generative pretrained transformer, which is the architecture inside a neural network. It is based on the principle from an earlier paper called pointer networks and the idea that instead of training a neural network to simply say, from this input, predict an output, train the network on the way to figure out what to pay attention to. Imagine I gave you a sentence with a couple of missing words. As a human being, you would pay attention to some words in that sentence more than others, like ‘then,’ ‘and,’ etc. But if you saw a reference to, say, a bridge, something that would give you context…. That is what transformers do. They learn what’s to pay attention to certain things. By doing that, you can take the power of a massively trained network and apply it at just the right places in the input. Before, the networks would look at every pixel, every word, every character evenly—not understanding which was important, which was not important. By being able to tell it what’s important, they become much more powerful.

Another example of context is Google knowing where you are located when you ask it for news. It will pay attention to where you are located.

Yes. That’s the higher level of the same sort of thing. These algorithms start at character levels, but as you add the different layers, you can get to something that now going to ask Google a question. As input to that question, it’s going to be, Where are you in the day? What’s your past history of questions? What document do you have open right now? Like that. All that provides context. It helps the algorithm produce something which is much more relevant and accurate to what you need. That’s transformers. The third and final thing is the notion of multimodal models. Multimodal refers to the types of digital data that the model can input and the model can output. We’re seeing that with the new DALL-E and Midjourney. You input text and you get an image. These models have learned a correlation between language and pixels. They understand that relationship. It turns out that any digital information can have that kind of understanding. You can train a network to connect language to audio, for example. You can connect animations to music. You can connect anything that has a remote correlation to it. You can train these networks now, but it’s not a trivial exercise. It’s very complex to actually do multimodal. There’s a lot to do with signal frequencies and alignment and all sorts of things. Lots of people have been working on that for about the last 8 years. What you’re seeing now with things like stable diffusion and Midjourney and Runway and others is people that have combined two technologies. That’s just unsupervised learning plus transformers and hey, we have multimodal.

Now, let’s take human beings. Imagine a baby blindfolded and isolated from birth. It never got to see anything. It never gets to touch anything. It can only hear, so it learns about the world through language. They would probably learn something, but what they learned would be a poor representation of the world. It wouldn’t know colors or textures. But a normal baby is learning through every modality that it can. One of the reasons we as humans are very good at reasoning about the world is because we learn through every modality that our body gives us. This is true of these networks. If you train a language model, you’ll get a pretty good language model at the end of the day. But if you train a language model together with imagery together with other things, it’s going to become a lot more intelligent.

A great example of this is from Google Maps. A long time ago, Google Maps was based purely on geospatial information. They mined city, state level geospatial databases, flat roads, found stop signs, etc. Sure, it’s going to have gaps and errors. Maybe there is no stop street or stop sign. Then they started doing Street View imagery. Their cars were driving around capturing images of everything. In those images, there’s a stop sign. Suddenly, their AI was able to learn. From that, it was able to fill in incomplete geospatial data. It could determine that there’s a high probability of a stop sign because the image that it has with the exact GPS coordinate had a stop sign in it. That’s an example of you had one mode of data, which was the geospatial data that you were able to bring this other mode of data, which was the Street View imagery, and  bring them both in and then mash them together and come up with a better map.

To be continued…