How AI Powers Self-Driving Tesla with Elon Musk and Andrej Karpathy

IntroductionPermalink

so pete told you all about the chip that we’ve designed that runs neural networks in the car my team is responsible for training of these neural networks and that includes all of data collection from the fleet neural network training and then some of the deployment onto that chip um so what do the neural networks do exactly in the car so what we are seeing here is a stream of videos from across the vehicle across the car these are eight cameras that uh send us videos and then these neural networks are looking at those videos and are processing them and making predictions about what they’re seeing and so some of the things that we’re interested in some of the things you’re seeing on this visualization here are lane line markings other objects the distances to those objects what we call drivable space shown in blue which is where the car is allowed to go and a lot of other predictions like traffic lights traffic signs and so on now for my talk i will talk roughly into in three stages so first i’m going to give you a short primer on neural networks and how they work and how they’re trained and i need to do this because i need to explain in the second part why it is such a big deal that we have the fleet and why it’s so important and why it’s a key enabling factor to really training these neural networks and making them work effectively on the roads and in the third stage i’ll talk about the vision and lidar and how we can estimate depth just from vision alone so the core problem that these networks are solving in the car is that of visual recognition so for uni these are very this is a very simple problem uh you can look at all these four images and you can see that they contain a cello about an iguana or scissors so this is very simple and effortless for us this is not the case for computers and the reason for that is that these images are to a computer really just a massive grid of pixels and at each pixel you have the brightness value uh at that point and so instead of just seeing an image a computer really gets a million numbers in a grid that tells you the brightness values at all the positions the matrix if you will it really is the matrix yeah and so we have to go from that grid of pixels and brightness values into high-level concepts like iguana and so on and as you might imagine this iguana has a certain pattern of brightness values but iguanas actually can take on many appearances so they can be in many different appearances different poses and different brightness conditions against different backgrounds you can have different crops of that iguana and so we have to be robust across all those conditions and we have to understand that all those different brightness patterns actually correspond to iguanas now the reason you and i are very good at this is because we have a massive neural network inside our heads that’s processing those images so light hits the retina travels to the back of your brain to the visual cortex and the visual cortex consists of many neurons that are wired together and that are doing all the pattern recognition on top of those images and really over the last i would say about five years um the state-of-the-art approaches to processing images using computers have also um started to use neural networks but in this case artificial neural networks but these artificial neural networks and this is just a cartoon diagram of it are a very rough mathematical approximation to your visual cortex we really do have neurons and they are connected together and here i’m only showing three or four neurons in three or four in four layers but atypical neural network will have tens to hundreds of millions of neurons and each neuron will have a thousand connections so these are really large pieces of almost simulated tissue um and then what we can do is we can take those neural networks and we can show them images so for example i can feed my iguana into this neural network and the network will make predictions about what it’s seen now in the beginning these neural networks are initialized completely randomly so the connection strengths between all those different neurons are completely random and therefore the predictions of that network are also going to be completely random so it might think that you’re actually looking at a boat right now and it’s very unlikely that this is actually an iguana and during the training during the training process really what we’re doing is we know that that’s actually an iguana we have a label so what we’re doing is we’re basically saying we’d like the probability of iguana to be larger for this image and the probability of all the other things to go down and then there’s a mathematical process called back propagation as the cast a gradient descent that allows us to backpropagate that signal through those connections and update every one of those connections sorry and update every one of those connections just a little amount and once the update is complete the probability of iguana for this image will go up a little bit so it might become 14 and the probability of the other things will go down and of course we don’t just do this for this single image we actually have entire large data sets that are labeled so we have lots of images typically you might have millions of images thousands of labels or something like that and you are doing forward backward passes over and over again so you’re showing the computer here’s an image it has an opinion and then you’re saying this is the correct answer and it tunes itself a little bit you repeat this millions of times and sometimes you show images the same image to the computer you know hundreds of times as well so the network training typically will take on the order a few hours or a few days depending on how big of a network you’re training and that’s the process of training a neural network now there’s something very unintuitive about the way neural networks work that i have to really get into and that is that they really do require a lot of these examples and they really do start from scratch they know nothing and it’s really hard to wrap your head around around this so as an example here’s a cute dog and you probably may not know the breed of this dog but the correct answer is that this is a japanese spaniel now all of us are looking at this and we’re seeing japanese spanish we’re like okay i got it i understand kind of what this japanese spaniel looks like and if i show you a few more images of other dogs you can probably pick out other japanese spaniels here so in particular those three look like a japanese panel and the other ones do not so you can do this very quickly and you need one example but computers do not work like this they actually need a ton of data of japanese panels so this is a grid of japanese panels showing them you need thousands of examples showing them in different poses different brightness conditions different backgrounds different crops you really need to teach the computer from all the different angles what this japanese spanish looks like and it really requires all that data to get that to work otherwise the computer can’t pick up on that pattern automatically so what does all this imply about the setting of self-driving of course we don’t care about dog breeds too much maybe we will at some point but for now we really care about land line markings objects where they are where we can drive and so on so the way we do this is we don’t have labels like iguana for images but we do have images from the fleet like this and we’re interested in for example layla markings so we a human typically goes into an image and using a mouse annotates the lane line markings so here’s an example of an annotation that a human could create a label for this image and it’s saying that that’s what you should be seeing in this image these are the line markings and then what we can do is we can go to the fleet and we can ask for more images from the fleet and uh if you ask the fleet if you just do a nave job of this and you just ask for images at random the fleet might respond with images like this uh typically going forward on some highway this is what um you might just get like a random collection like this and we would annotate all that data now if you’re not careful and you only annotate a random distribution of this data your network will kind of pick up on this this random distribution on data and work only in that regime so if you show a slightly different example for example here is an image that actually the road is curving and it’s a bit of a more residential neighborhood then if you show the neural network this image that network might make a prediction that is incorrect it might say that okay well i’ve seen lots of times on highways lanes just go forward so here’s a possible prediction and of course this is very incorrect but the neural network really can’t be blamed it does not know that the train on the the tree on the left whether or not it matters or not it does not know if the car on the right matters or not towards the lane line it does not know that the uh that the um buildings in the background matter or not it really starts completely from scratch and you and i know that the truth is that none of those things matter what actually matters is there are a few white lane line markings over there in the in a vanishing point and the fact that they curl a little bit should pull the prediction except there’s no mechanism by which we can just tell the neural network hey those line markings actually matter the only tool in the toolbox that we have is labeled data so what we do is we need to take images like this when the network fails and we need to label them correctly so in this case we will turn the lane to the right and then we need to feed lots of images of this to the neural net and neural net over time will accumulate will basically pick up on this pattern that those things there don’t matter what those lane line markings do and we learn to predict the correct lane so what’s really critical is not just the scale of the data set we don’t just want millions of images we actually need to do a really good job of covering the possible space of things that the car might encounter on the roads so we need to teach the computer how to handle scenarios where it’s night and wet you have all these different specular reflections and as you might imagine the brightness patterns and these images will look very different we have to teach a computer how to deal with shadows how to deal with forks in the road how to deal with large objects that might be taking up most of that image how to deal with tunnels or how to deal with construction sites and in all these cases there’s no again explicit mechanism to tell the network what to do we only have massive amounts of data we want to source all those images and we want to annotate the correct lines and the network will pick up on the patterns of those now large and very data sets make basically make these networks work very well this is not just defining for us here at tesla this is a ubiquitous finding across the entire industry so experiments and research from google from facebook from baidu from um alphabets deepmind all show similar plots where neural networks really love data and loft scale and variety as you add more data these neural networks start to work better and get higher accuracies for free so more data just makes them work better now a number of companies have a number of people have kind of pointed out that potentially we could use simulation to actually achieve the scale of the data sets and we’re in charge of a lot of the conditions here maybe we can achieve some variety in the simulator now at tesla and that was also kind of brought up in the question uh questions uh just just before this now a tesla this is actually uh a screenshot of our own simulator we use simulation uh extensively we use it to develop and evaluate the software we also even use it for training quite successfully so but really when it comes to training data for neural networks there really is no substitute for real data the simulator simulations have a lot of trouble with modeling appearance physics and the behaviors of all the agents around you so there are some examples to really drive that point across the real world really throws a lot of crazy stuff at you uh so in this case for example we have very complicated environments with snow with trees with wind we have various visual artifacts that are hard to simulate potentially we have complicated construction sites bushes and uh plastic bags that can go in that can uh kind of go around with the wind a complicated construction size that might feature lots of people kids animals all mixed in and simulating how those things interact and flow through this construction zone might actually be completely completely intractable it’s not about the movement of any one pedestrian in there it’s about how they respond to each other and how those cars respond to each other and how they respond to you driving in that setting uh and all of those are actually really tricky to simulate it’s almost like you have to solve the self-driving problem to just simulate other cars in your simulation so it’s really complicated so we have dogs exotic animals and in some cases it’s not even that you can’t simulate it is that you can’t even come up with it so for example i didn’t know that you can have truck on truck on truck like that but in the real world you find this and you find lots of other things that are very hard to really even come up with so really the variety that i’m seeing in the data coming from the fleet is just crazy with respect to what we have in simulator we have a really good simulator yeah it’s i mean simulation you’re fundamentally grading you’re creating your own homework so you you know you if you know that you’re going to simulate it okay you can definitely solve for it but as andre is saying you don’t know what you don’t know the world is very weird and has millions of corner cases uh and if if somebody can produce a self-driving simulation that accurately matches reality that in itself would be in a monumental achievement of of human capability they can’t there’s no way yeah yeah uh so i think the three points that i really try to drive home until now are to get neural networks to work well you require these three essentials you require a large data set a varied data set and a real data set and if you have those capabilities you can actually train your networks and make them work very well and so why is tesla is such a unique and interesting position to really get all these three essentials right and the answer to that of course is the fleet we can really source data from it and make our neural network systems work extremely well so let me take you through a concrete example of for example um making the object detector work better to give you a sense of how we develop these networks how we iterate on them and how we actually get them to work over time so object detection is something we care a lot about we’d like to put bounding boxes around say the cars and the the objects here because we need to track them and we need to understand how they might move around so again we might ask human annotators to give us some annotations for these and humans might go in and might tell you that okay those patterns over there are cars and bicycles and so on and you can train your neural network on this but if you’re not careful the neural network will make mis-predictions in some cases so as an example if we stumble by a car like this that has a bike on the back of it then the neural network actually when i joined would actually create two detections it would create a car detection and a bicycle detection and that’s actually kind of correct because i guess both of those objects actually exist but for the purposes of the controller and the planner downstream you really don’t want to deal with the fact that this bicycle can go with the car the truth is that that bike is attached to that car so in terms of like just objects on the road there’s a single object a single car and so what you’d like to do now is you’d like to just potentially annotate lots of those images as this is just a single car so the process that we that we go through internally in the team is that we take this image or a few images that show this pattern and we have a mechanism a machine learning mechanism by which we can ask the fleet to source us examples that look like that and the fleet might respond with images that contains those patterns so as an example these six images might come from the fleet they all contain bikes on backs of cars and uh we would go in and we would annotate all those as just a single car and then the the performance of that detector actually improves and the network internally understands that hey when the bike is just attached to the car that’s actually just a single car and it can learn that given enough examples and that’s how we’ve sort of fixed that problem i will mention that i talk quite a bit about sourcing data from the fleet i just want to make a quick point that we’ve designed this from the beginning with privacy in mind and all the data that we use for training is anonymized now the fleet doesn’t just respond with bicycles and backs of cars we look for all the things we look for lots of things all the time so for example we look for boats and the fleet can respond with boats we look for construction sites and the fleet can send us lots of construction sites from across the world we look for even slightly more rare cases so for example finding debris on the road is pretty important to us so these are examples of images that have streams to us from the fleet that show tires cones plastic bags and things like that if we can source these at scale we can annotate them correctly and the neural network will learn how to deal with them in the world here’s another example animals of course also a very rare occurrence an event but we want the neural network to really understand what’s going on here that these are animals and we want to deal with that correctly so to summarize the process by which we iterate on neural network predictions looks something like this we start with a c data set that was potentially sourced at random we annotate that data set and then we train neural networks on that data set and put that in the car and then we have mechanisms by which we notice inaccuracies in the car when this detector may be misbehaving so for example if we detect that the neural network might be uncertain or if we detect that or if there’s a driver intervention or any of those settings we can create this trigger infrastructure that sends us data of those inaccuracies and so for example if we don’t perform very well on lane line detection on tunnels then we can notice that there’s a problem in tunnels that image would enter our unit test so we can verify that we’re actually fixing the problem over time but now what you do is to fix this uh inaccuracy you need to source many more examples that look like that so we asked the fleet to please send us many more tunnels and then we label all those tunnels correctly we incorporate that into the training set and we retrain the network redeploy and iterate the cycle over and over again and so we refer to this iterative process by which we improve these predictions as the data engine so iteratively deploying something potentially in shadow mode uh sourcing inaccuracies and incorporating the training set over and over again we do this basically for all the predictions of these neural networks now so far i’ve talked about a lot of explicit labeling so like i mentioned we ask people to annotate data this is an expensive process in time and also with respect to yeah it’s just an expensive process and so these annotations of course can be very expensive to achieve so what i want to talk about also is really to utilize the power of the fleet you don’t want to go through the human annotation model like you want to just stream in data and automate it automatically and we have multiple mechanisms by which we can do this so as one example of a project that we recently um worked on is the detection of currents so you’re driving down the highway someone is on the left or on the right and they cut in front of you into your lane so here’s a video showing the autopilot detecting that this car is intruding into our lane now of course we’d like to detect a current as fast as possible so the way we approach this problem is we don’t write explicit code for is the left blinker on is the right blinker on track the keyboard over time and see if it’s moving horizontally we actually use a fleet learning approach so the way this works is we ask the fleet to please send us data whenever they see a car transition from a right lane to the center lane or from left to center and then what we do is we rewind time backwards and we automatically can annotate that hey that car will turn will in 1.3 seconds cut in front of the in front of you and then we can use that for training the neural net and so the neural net will automatically pick up on a lot of these patterns so for example the cars are typically odd they’re moving this way maybe the blinker is on all that stuff happens internally inside the neural net just from these examples so we asked the fleet to automatically send out all these data we can get half a million or so images and all of these would be annotated for cuttings and then we train the network um and then we took this cutting network and we deployed it to the fleet but we don’t turn it on yet we run it in shadow mode and in shadow mode the network is always making predictions hey i think this this vehicle is going to cut in from the way it looks this vehicle is going to cut in and then we look for mispredictions so as an example this is a clip that we had from shadow mode of the cutting network and it’s kind of hard to see but the network thought that the vehicle right ahead of us and on the right is going to cut in and you can sort of see that it’s it’s slightly flirting with the lane line it’s trying to it’s sort of encroaching a little bit and the network got excited and it thought that that was going to be cut in that vehicle will actually end up in our center lane that turns out to be incorrect and the vehicle did not actually do that so what we do now is we just churned the data engine we source that ran in the shadow mode it’s making predictions it makes some false positives and there are some false negative detections so we got over excited and sometimes and sometimes we missed a cut in when it actually happened all those create a trigger that streams to us and that gets incorporated now for free there’s no humans harmed in the process of labeling this data incorporate it for free into our training set we retrain the network and we deploy the shadow mode and so we can spin this a few times and we always look at the false positives and negatives coming from the fleet and once we’re happy with the false positive false negative ratio we actually flip a bit and actually uh let the car control to that network and so you may have noticed we actually shipped one of our first versions of a cutting extractor um approximately i think three months ago so if you’ve noticed that the car is much better detecting cottons that’s fleet learning operating at scale yes it actually works quite nicely so that’s quite learning no humans were harmed in the process it’s just a lot of neural network training based on data and a lot of shadow mode and looking at those results another essentially like um everyone’s training the network all the time is what it amounts to whether that whether to water polish on or off uh the network is being trained every mile that’s driven uh for the car that’s harder to or above is training the network yeah another interesting way that we use this in the scheme of fleet learning and the other project that i will talk about is a path prediction so while you are driving the car what you’re actually doing is you are annotating the data because you are steering the wheel you’re telling us how to traverse different environments so what we’re looking at here is a some person in the fleet who took a left through an intersection and what we do here is we we have the full video of all the cameras and we know that the path that this person took because of the gps the initial measurement unit the wheel angle the wheel ticks so we put all that together and we understand the path that this person took through this environment and then of course this uh this we can use this for uh supervision for the network so we just source a lot of this from the fleet we train a neural network on the on those trajectories and then the neural network predicts paths uh just from that data so really what this is referred to typically is called imitation learning we’re taking human trajectories from the real world i’m just trying to imitate how people drive in real worlds and we can also apply the same data engine crank to all of this and make this work over time um so here’s an example of path prediction going through a kind of a complicated environment so what you’re seeing here is a video and we are overlaying the predictions of the network so this is a path that the network would follow um in green and yeah i mean the crazy thing is the network is predicting paths it can’t even see with incredibly high accuracy they can’t see around the corner but but it’s saying the probability of that curve is extremely high so that’s the path and it nails it you will see that in the cars today uh we’re going to turn on augmented vision so you can see the the the the lane lines and the path predictions of the cars uh over later on the video yeah there’s actually more going on under hood that you can even tell it’s kind of scary to be honest yeah and of course there’s a lot of details i’m skipping over you might not want to annotate all the drivers you might annotate just you might want to just imitate the better drivers and there’s many technical ways that we actually slice nice that data um but the interesting thing here is that this prediction is actually a 3d prediction that we project back to the image here so the path here forward is a three-dimensional thing that we’re just rendering in 2d but we know about the slope of the ground from all this and that’s actually extremely valuable for driving uh so that prediction actually is live in the fleet today by the way so if you’re driving clover leafs if you’re in a cloverleaf on the highway until maybe five months ago or so your car would not be able to do cloverleaf now it can that’s pat prediction running live on your cars we’ve shipped this a while ago and today you are going to get to experience this for traversing intersections a large component of how we go through intersections in your drives today is all sourced from path prediction from automatic labels so i talked about so far is really the three key components of how we iterate on the predictions of the network and how we make it work over time you require large varied and real data set we can really achieve that here at tesla and uh we do that through the scale of the fleet the data engine shipping things in shadow mode iterating that cycle and potentially even using fleet learning where no human annotators are harmed in the process and just using data automatically and we can really do that at scale so in the next section of my talk i’m going to especially talk about depth perception using vision only so you might be familiar that there are at least two sensors in a car one is vision cameras just getting pixels and the other is lidar that a lot of uh a lot of companies also use and lidar gives you these point measurements of distance around you um now one one thing i’d like to point out first of all is you all came here you drove here many of you and you used your your neural net and vision you were not shooting lasers out of your eyes and you still ended up here we might have it so clearly the human neural net derives distance and all the measurements in the 3d understanding of the world just from vision it actually uses multiple cues to do so i’ll just briefly go over some of them just to give you a sense of roughly what’s going on in inside um as an example we have two eyes pointed out so you get two independent measurements at every single time step of the world ahead of you and your brain stitches this information together to arrive at some depth estimation because you can triangulate any points across those two viewpoints a lot of animals instead have eyes that are positioned on the sides so they have very little overlap in their visual fields so they will typically use structure from motion and the idea is that they bob their heads and because of the movement they actually get multiple observations of the world and you can triangulate again depths and even with one eye closed and completely motionless you can still have some sense of depth perception if you did this i don’t think you would notice me coming two meters towards you or 100 meters back and that’s because there are a lot of very strong monocular cues that your brain also takes into account this is an example of a pretty common visual illusion where you have you know these two blue bars are identical but your brain the way it stitches up the scene is it just expects one of them to be larger than the other because of the vanishing lines of this image so your brain does a lot of this automatically and uh and neural nets artificial neural nets can as well so let me give you three examples of how you can arrive at depth perception from vision alone a classical approach and two that rely on neural networks so here’s a video going down i think this is san francisco of a tesla so this is the our cameras are sensing and we’re looking at all i’m only showing the main camera but all the cameras are turned on the eight cameras of the autopilot and if you just have the six second clip what you can do is you can stitch up this environment in 3d using multi-use stereo techniques so this oops this is supposed to be a video is not a video oh i know it’s on there we go so this is the 3d reconstruction of those six seconds of that car driving through that path and you can see that this information is purely is it’s very well recoverable uh from just videos and roughly that’s through process of triangulation and as i mentioned multivisteria and we’ve applied similar techniques on slightly more sparse and approximate also in the car so it’s remarkable all that information is really there in the sensor and just a matter of extracting it the other project that i want to briefly talk about is as i mentioned there’s nothing about neural network neural networks are very powerful visual recognition engines and if you want them to predict depth then you need to for example look for labels of depth and then they can actually do that extremely well so there’s nothing limiting networks from predicting this molecular depth except for labeled data so one example project that we’ve actually looked at internally is we use the forward-facing radar which is shown in blue and that radar is looking out and measuring depths of objects and we use that radar to annotate the uh what vision is seeing the bounding boxes that come out of the neural networks so instead of human annotators telling you okay this this car and this bounding box is roughly 25 meters away you can annotate that data much better using sensors so you use sensor annotation so as an example radar is quite good at that distance you can annotate that and then you can train your network on it and if you just have enough data of it this neural network is very good at predicting those patterns so here’s an example of predictions of that so in circles i’m showing radar objects and in and the cuboids that are coming out here are purely from vision so the keyboards here are just coming out of vision and the depth of those cuboids is learned by a sensor annotation from the radar so if this is working very well then you would see that the circles in the top down view would agree with the cuboids and they do and that’s because neural networks are very competent at predicting depths uh they can learn the different sizes of vehicles internally and they know how big those vehicles are and you can actually derive depth from that quite accurately the last mechanism i will talk about very briefly is slightly more fancy and gets a bit more technical but it is a mechanism that has recently there’s a few papers basically over the last year or two on this approach it’s called self-supervision so what you do in a lot of these papers is you only feed raw videos into neural networks with no labels whatsoever and you can still learn you can still get neural networks to learn depth and it’s a little bit technical so i can’t go into the full details but the idea is that the neural network predicts depth at every single frame of that video and then there are no explicit targets that the neural network is supposed to regress to with the labels but instead the objective for the network is to be consistent over time so whatever depth you predict should be consistent over the duration of that video and the only way to be consistent is to be right as the neural network automatically predicts the correct depth for all the pixels and we reproduce some of these results internally so this also works quite well so in summary people drive with vision only no no lasers are involved this seems to work quite well the point that i’d like to make is that visual recognition and very powerful recognition is absolutely necessary for autonomy it’s not nice to have like we must have neural networks that actually really understand the environment around you and uh and lidar points are a much less information rich environment so vision really understands the full details just a few points around are much there’s much less information in those so as an example on the left here um is that a plastic bag or is that a tire a light arm i just give you a few points on that but vision can tell you which one of those two is true and that impacts your control is that person who is slightly looking backwards are they trying to merge in into your lane on the bike or are they just or are they just going forward in the construction sites what do those signs say how should i behave in this world the entire infrastructure that we have built up for roads is all designed for human visual consumption so all the size all the traffic lights everything is designed for vision and so that’s where all that information is and so you need that ability is that person distracted and on their phone are they going to work walk into your lane those answers to all these questions are only found in vision and are necessary for level 4 level five autonomy and that is the capability that we are developing at tesla and through this is done through combination of large scale neural network training through data engine and getting that to work over time and using the power of the fleet and so in this sense lidar is really a shortcut it sidesteps the fundamental problems the important problem visual recognition that is necessary for autonomy and so it gives a full sense of progress and is ultimately ultimately crutch it does give like really fast demos uh so if i was to summarize the entire um my entire talk in one slide it would be this every all of autonomy because you want level four level five systems that can handle all the possible situations in 99.99 of the cases and chasing some of the last few nights is going to be very tricky and very difficult and is going to require a very powerful visual system so i’m showing you some images of what you might encounter in any one slice of that nine so in the beginning you just have very simple cars going forward then those cars start to look a little bit funny then maybe you have bikes on cars then maybe you have cars on cars but maybe you start to get into really rare events like cars turned over or even cars airborne we see a lot of things coming from the fleet and we see them at some rate at like a really good rate compared to all of our competitors and so the rate of progress at which you can actually address these problems iterate on the software and really feed the neural networks with the right data that rate of progress is really just proportional to how often you encounter these situations in the wild and we encounter them significantly more frequently than anyone else which is why we’re going to do extremely well thank you

Share on

Twitter Facebook LinkedIn