Deep learning has been at the forefront of the so-called AI revolution for years now, and many people believed that it would take us to the world of the technological singularity. Many companies talked big in 2014, 2015, and 2016 when technologies such as Alpha Go were pushing new boundaries. For example, Tesla announced that its fully self-driving cars were very close, even selling that option to customers — to be enabled later via a software update.

We are now in the middle of 2018 and things have changed. Not on the surface yet — the NIPS conference is still oversold, corporate PR still has AI all over its press releases, Elon Musk still keeps promising self-driving cars, and Google keeps pushing Andrew Ng’s line that AI is bigger than electricity. But this narrative is beginning to crack. And as I predicted, the place where the cracks in AI are most visible is autonomous driving — an actual application of the technology in the real world.

The dust settled on deep learning

When the ImageNet visual recognition challenge was effectively solved (note: This does not mean that vision is solved) over the period of 2012 to 2017, many prominent researchers, including Yann Lecun, Andrew Ng, Fei-Fei Li, and the typically quiet Geoff Hinton, were actively giving press interviews and publicizing on social media. The general tone was that we were in the forefront of a gigantic revolution and from then on things would only accelerate. Well, years have passed and their Twitter feeds have become less active, as exemplified by Ng’s numbers:

  • 2013: 0.413 tweets per day
  • 2014: 0.605 tweets per day
  • 2015: 0.320 tweets per day
  • 2016: 0.802 tweets per day
  • 2017: 0.668 tweets per day
  • 2018: 0.263 tweets per day (as of May 24)

Perhaps this is because Ng’s grand claims are now put to more scrutiny by the community, as illustrated by the below thread:

The sentiment has quite considerably declined. We’re seeing much fewer tweets praising deep learning as the ultimate algorithm, and the papers are becoming less “revolutionary” and much more “evolutionary.” DeepMind hasn’t shown anything breathtaking since Alpha Go zero, and even that wasn’t that exciting, given the obscene amount of compute necessary and its applicability to games only (see Moravec’s paradox.) OpenAI was rather quiet, with its last media burst being the Dota 2 playing agent — which I suppose was meant to create as much buzz as Alpha Go, but it fizzled out rather quickly. In fact, articles began showing up claiming that even Google does not know what to do with DeepMind, as its results are apparently not as practical as expected.

As for the prominent researchers, they’ve been generally touring around meeting with government officials in Canada or France to secure their future grants. Yann Lecun even stepped down — rather symbolically — at Facebook from the head of research to the chief AI scientist role. This gradual shift from rich, big corporations to government-sponsored institutes suggests to me that the corporate interest in this kind of research is slowly winding down. Again, these are all early signs — nothing spoken out loud, just tipped through body language.

Deep learning (does not) scale

One of the key slogans repeated about deep learning is that it scales almost effortlessly. In 2012, AlexNet had about 60 million parameters; we probably now have models with at least a thousand times that number, right? Well, we probably do — the question is, Are these things a thousand times as capable? Or even a hundred times as capable? A study by OpenAI comes in handy here:

In terms of applications for vision, we see that VGG and Resnets saturated somewhat around one order of magnitude of compute resources applied, and the number of parameters has actually fallen. Xception, a variation of Google Inception architecture, only slightly outperforms Inception on ImageNet — which arguably means it outperforms everyone, because essentially AlexNet solved ImageNet. So at 100 times more compute than AlexNet, we saturated architectures in terms of image classification. Neural machine translation is a big effort by all the big web search players, and no wonder it takes all the compute it can take.

The latest three points on that graph, interestingly, show reinforcement learning related projects applied to games: DeepMind and OpenAI. Particularly AlphaGo Zero and the slightly more general AlphaZero take a ridiculous amount of compute, but are not applicable in the real world applications because much of that compute is needed to simulate and generate the data these data-hungry models need.

OK, so we can now train AlexNet in minutes rather than days, but can we train a thousand-times bigger AlexNet in days and get qualitatively better results? Apparently not.

This graph, which was meant to show how well deep learning scales, actually indicates the exact opposite. We can’t just scale up AlexNet and get respectively better results. We have to fiddle with specific architectures, and effectively additional compute does not buy much without an order of magnitude more data samples, which are in practice only available in simulated game environments.

Self-driving crashes

By far the biggest blow to deep learning fame is the domain of self-driving vehicles. Initially, some thought that end-to-end deep learning could somehow solve this problem, a premise particularly heavily promoted by Nvidia. I don’t think there is a single person on Earth who still believes that, though I could be wrong.

Looking at last year’s California DMV disengagement reports, Nvidia-equipped cars could not drive ten miles without a disengagement. In a separate post, I discuss the general state of that development and compare it to human driver safety, which (spoiler alert) is not looking good.

Since 2016 there were several Tesla AutoPilot incidents, some of which were fatal. Arguably, Tesla Autopilot should not be confused with self-driving, but at least at the core, it relies on the same kind of technology. As of today, even leaving aside occasional spectacular errors, it still cannot stop at an intersection, recognize a traffic light, or even navigate through a roundabout. That last video is from March 2018, several months after the promised coast to coast Tesla autonomous drive that did not happen (the rumor is the company could not get it to work without about 30 disengagements).

Several months ago, in February 2018, Elon Musk said in a conference call, when asked about the coast to coast drive:

We could have done the coast-to-coast drive, but it would have required too much specialized code to effectively game it or make it somewhat brittle and that it would work for one particular route, but not the general solution. So I think we would be able to repeat it, but if it’s just not any other route, which is not really a true solution. (…)

I am pretty excited about how much progress we’re making on the neural net front. And it’s a little — it’s also one of those things that’s kind of exponential where the progress doesn’t seem — it doesn’t seem like much progress, it doesn’t seem like much progress, and suddenly wow. It will feel like, Well, this is a lame driver, lame driver. Like OK, that’s a pretty good driver. Like “Holy cow! This driver’s good.” It’ll be like that.

Well, looking at the graph above from OpenAI, I am not seeing that exponential progress. Neither is it visible in miles before disengagement for pretty much any big player in this field. In essence, the above statement should be interpreted: “We currently don’t have the technology that could safely drive us coast to coast, though we could have faked it if we really wanted to (maybe). We deeply hope that some sort of exponential jump in capabilities of neural networks will soon happen and save us from disgrace and massive lawsuits.”

But by far the biggest pin punching through the AI bubble was the accident in which an Uber self-driving car killed a pedestrian in Arizona. In the preliminary report by the NTSB, we can read some astonishing statements:

Aside from general system design failure apparent in this report, it is striking that the system spent long seconds trying to decide what exactly it sees in front — whether that be a pedestrian, bike, vehicle, or whatever else — rather than making the only logical decision in these circumstances, which was to make sure not to hit it.

There are several reasons for this: First, people will often verbalize their decisions post factum. So a human will typically say, “I saw a cyclist, therefore, I veered to the left to avoid him.” A huge amount of psychophysical literature will suggest a quite different explanation: “A human saw something that was very quickly interpreted as an obstacle by fast perceptual loops of his nervous systems, and he performed a rapid action to avoid it, long seconds later realizing what happened and providing a verbal explanation.”

There are many decisions we make every day that are not verbalized, and driving includes many of them. Verbalization is costly and takes time, and reality often does not provide that time. These mechanisms have evolved for a billion years to keep us safe, and driving context (although modern) makes use of many such reflexes. And since these reflexes have not evolved specifically for driving, they may induce mistakes. A knee-jerk reaction to a wasp buzzing in a car may have caused many crashes and deaths. But our general understanding of 3D space, speed, the ability to predict the behavior of agents, and the behavior of physical objects traversing through our path are the primitive skills that were just as useful 100 million years ago as they are today, and they’ve been honed by evolution.

But because most of these things are not easily verbalized, they are hard to measure, and consequently, we don’t optimize our machine learning systems on these aspects at all — see my earlier post for benchmark proposals that would address some of these capabilities. Now, this would speak in favor of Nvidia’s end-to-end approach — learn image -> action mapping, skipping any verbalization — and in some ways, this is the right way to do it. The problem is that the input space is incredibly high dimensional, while the action space is very low dimensional. Hence the “amount” of “label” (readout) is extremely small compared to the amount of information coming in.

In such a situation, it is easy to learn spurious relations, as exemplified by adversarial examples in deep learning. A different paradigm is needed, and I postulate prediction of the entire perceptual input along with the action as a first step to make a system able to extract the semantics of the world, rather than spurious correlations — read more about my first proposed architecture, called Predictive Vision Model.

In fact, if there is anything at all we learned from the outburst of deep learning, it is that (10k+ dimensional) image space has enough spurious patterns in it that they actually generalize across many images and make the impression like our classifiers actually understand what they are seeing. Nothing could be further from the truth, as admitted even by the top researchers who are heavily invested in this field. In fact, Yann Lecun warned about overexcitement and AI winter for a while, and even Geoffrey Hinton — the father of the current outburst of backpropagation — said in an Axios interview that this likely is all a dead end and we need to start over. At this point, though, the hype is so strong that nobody will listen, even to the founding fathers of the field.

I should mention that more top-tier researchers are recognizing the hubris and have the courage to openly call it out. One of the most active in that space is Gary Marcus. Although I don’t agree with everything that Marcus proposes in terms of AI, we certainly agree that it is not yet as powerful as the propaganda claims. In fact, it is not even close. For those who missed it, in “Deep learning: A critical appraisal” and “In defense of skepticism about deep learning,” he meticulously deconstructs the deep learning hype. I respect Marcus a lot; he behaves like a real scientist should, while most so-called “deep learning stars” just behave like cheap celebrities.

Conclusion

Predicting the AI winter is like predicting a stock market crash: It’s impossible to tell precisely when it will happen, but it’s almost certain that it will happen at some point. Much like before a stock market crash, there are signs of the impending collapse, but the narrative is so strong that it is very easy to ignore them — even if they are in plain sight. In my opinion, signs already show a huge decline in deep learning (and probably in AI in general as this term has been abused ad nauseam), yet hidden from the majority by an increasingly intense narrative. How “deep” will that winter be? I have no idea. What will come next? I have no idea. But I’m pretty positive it is coming, perhaps sooner rather than later.

This story originally appeared on Piekniewski’s blog. Copyright 2018.

Filip Piekniewski is a researcher working on computer vision and AI.