Andrej Karpathy (Tesla): CVPR 2021 Workshop on Autonomous Vehicles

Introduction

i would like to talk a bit more about what tesla has been up for the last few months um so first to start uh i guess we are here at the cbpr uh workshop on autonomous driving so i think i’m preaching to the core a little bit i’d like to start with some slides on like why we’re doing all of this and i think unfortunately we’re kind of like in a bad shape when it comes to transportation in society and uh basically the issue is that we have these metallic objects traveling incredibly quickly and with very high kinetic energy and we are putting meat in the control loop into the system and so this is quite undesirable and really fundamentally what it comes down to is people are not very good at driving they get into a lot of trouble uh in accidents uh they don’t want to drive and also in terms of economics we are involving people in transportation and of course we’d like to automate transportation and really reap the benefits of that automation in society so we all i think are on the same page that it should be possible to replace the meat computer with a silicone computer and get a lot of benefits out of it in terms of safety and convenience and economics so in particular silicon computers have significantly low latencies they have 360 degree situational awareness they are fully attentive and never check their instagram and uh alleviate all of the issues that are presented and so i think uh this is not super shocking and a lot of people have been kind of really looking forward to this kind of feature this is a frame from irobot a really good movie if you haven’t seen it and here will smith is about to drive the car manually and uh this other person here is shocked that this is going to happen this is ridiculous like why would you actually have a huge driving cars um and so i think this is not very far from truth and this movie is taking place in 2035 uh so i think by then um actually i think this is a pretty uh present prediction uh so what i think is kind of unique about tesla and uh in our approach to autonomy is that we take a very incremental approach towards this problem so in particular we already have customers with the autopilot package and we have millions of cars and uh the the autopilot software is always running and providing active safety features and of course the autopilot functionality so we provide additional safety and convenience to our customers today and also the team is working on full self-driving ability uh so i would really like to very quickly speak to uh the value day that we provide just to give you some examples of what the team is up to over uh sort of doing today uh so here’s an automatic emergency braking scenario uh so the driver here is proceeding through an intersection and you can see trouble coming here so the person kind of crept up from here and the car sees the pedestrian object detection kicks in and we slam on the brakes and we avert the coalition here’s an example of a traffic control warning so this person is probably not paying attention potentially on the phone is not breaking for the traffic lights ahead but we see that these are relevant traffic lights they are red and we beat and the person immediately starts slowing and probably you know does not actually enter the intersection uh these are examples from uh pedal misapplication mitigation pmm here a person is unparking from their driving spot and they are trying to turn and then they mess up and they accidentally floor it so they floor it right there and the system kicks in it sees pedestrians ahead of us and it slams on the brakes and converts what would be a very gnarly situation in this case um here’s another last scenario i can show briefly so this person is trying to park and they turn to the right here and they intend to press the brake but they actually floor it and the system again sees that there’s no driving path forward and actually slams and breaks and inverts you know in this case this person is not going to be swimming in this river imminently uh so these are some examples of the value that we are providing today in this incremental autonomy fashion but of course the team is actually also tonight sorry about that uh the team is um working primarily on the fsd functionality which is our wholesale dragon suite so here we have about fs we have the ssd beta product and it’s in the hands of about 2000 customers and this is an example from one of the customers driving around san francisco and uh these people are posting videos on youtube so you can check out a lot of videos but we’re showing here on the instrument cluster is we’re showing all of our predictions so you’re seeing some road edges some lines some objects and the car of course is navigating around autonomously here in san francisco environment now we of course drive this extensively as engineers and so it’s actually fairly routine for us to have zero intervention drives i would say in like sparsely populated areas like palo alto and so on i would say we definitely struggle a lot more in uh very adversarial environments like san francisco a lot of people working in autonomy of course know all about that as well so this drive ends up being a fairly long drive and it’s zero interventions in this case now one thing i’d like to point out always whenever i showed the video of tesla driving around autonomously in these city environments is you’ve seen this before and you’ve seen it for a decade so or more so here’s a waymo taking a left at an intersection and uh this is actually a pretty old video i believe and so you’ve been seeing stuff like this for a very long time so what is the big deal why is this impressive and so on and i think the important thing to realize and that i like to always stress is that even though these two scenarios look the same so there’s a car taking a left at an intersection under the hood and in terms of the scalability of the system things are incredibly different so in particular a lot of the competing approaches in the in the industry uh take this lidar plus hd map approach and so the idea is that you take an expensive sensor a lidar on the top of the car and um it basically gives you rangefinding around around the vehicle in 360 and gives you a point cloud and what you have to do is you have to pre-map the environment with the lidar sensor and you have to create a high definition map and then you have to insert all of the lanes and how they connect and all the traffic lights and where they are and you basically have to create a high definition map and then a test time you are simply localizing to that map to drive around and so the approach we take is vision based primarily so everything that happens uh happens for the first time there in the car based on the videos from the eight cameras that surround the car and so we come to an intersection for the very first time and we have to figure out where are the lanes how do they connect what are the traffic lights which ones are relevant what what traffic lights control what lanes everything is happening at that time um on that car and we don’t have too much high definition sort of information at all and this is actually a significantly more scalable approach because if our product is on a scale of millions of customers on earth and so it’s actually quite unscalable to uh to actually collect build and maintain these high definition lidar amounts it would be incredibly expensive to keep this infrastructure up to date and so we take the vision-based approach which of course is much more difficult because you actually have to get to get neural networks that uh function incredibly well based on the videos uh but once you actually get that to work it’s a general vision system and can in principle be deployed anywhere on earth so that’s really the the problem that we are solving now in terms of the cartoon diagram of the typical sensing suite of an autonomous vehicle these are kind of like the typical sensors you would see and as i mentioned we do not use high-definition maps and we do not use lidar we only use uh video cameras and in fact um the um vision system that we have been building over the last few years has been getting so incredibly good that it’s kind of leaving a lot of the other sensors in the dust and so actually um the cameras are doing most of the heavy lifting in terms of the perception that you’ve seen in the car and actually it’s gotten to the point that we are able to start removing some of the other sensors because they are just becoming these crushes that you start to not really need at all so actually three weeks ago we started to ship cars that have no radar at all so we deleted the radar and we are driving on vision alone in these cars and the reason we are doing this i think is well expressed by elon and suite he’s saying uh when radar and vision disagree which one do you believe vision has much more precision so better to double down vision than do sensor fusion and what he’s referring to is basically like vision is getting to the point where the sensor is like 100x better than say radar then if you have a sensor that is dominating the other sensor and so much better then the other sensor is actually starting to like really contribute it’s actually holding back and it’s really starting to attribute noise uh to the former system and so we are really doubling down uh on the vision only approach and um actually in this talk what i would like to primarily talk about is how we have achieved vision only control without radar and how we have released this quite successfully so far to defeat and what it has sort of um taken for us to for us to do that so here i wanted to show a video um basically showing what the input of the system is so we have eight cameras around and these are fairly high definition cameras and we have 36 frames a second and you can see that basically you can actually understand a lot about this environment from this input it’s extremely information rich you’re getting a huge amount of constraints eight million bits roughly of constraints per second on the state of the surroundings and so it’s incredibly information rich compared to any other sensor you may want to have in the car and so this is the one that we’re doubling down on because this is ultimately where all the difficult um sort of scene interpretation challenges lie so we prefer to just focus all of our infrastructure and development on this and we’re not wasting people working on the radar stack and the sensor fusion stack we only have one team division team now typically when we talk about vision but the reason people are squeamish about using vision is and the sort of challenge that almost always is pointed out is people are not certain if um neural networks are able to do range finding or depth estimation of these objects so obviously humans drive around with vision so obviously where our neural net is able to process visual input to understand the depth and velocity of all the objects around us but the big question is can our synthetic neural networks also do the same and i think the answer to us internally over the last few months that we’ve worked on this is an unequivocal yes and i’d like to sort of talk about that so in particular the radar api or like what the radar really gives you is it gives you a quite accurate depth and velocity measurement it’s a pretty direct measurement of the car ahead of you in very nominal situations so here i’m showing a random image on the left and you are seeing the depth velocity and acceleration recorded by the radar stack for this car and you can see that it’s a bit wiggly on the velocity so this is some kind of a stop and go scenario now when radar has a good lock on the car in front of you it’s given incredibly good depth and velocity but the problem with radar is like once in a while at random it will give you a dumb measurement and you will not know when that is and so it’s incredibly hard to fuse with vision in our experience so it might suddenly see a spurious stationary object because of the manhole or because of the bridge or crossing objects are not very well tracked or oncoming objects are not very well tracked and it ends up contributing noise i’ll go into more detail in a bit but for our purposes what we want to do right now is we want to match the quality of these predictions using vision alone so the question is how can you get a neural net to predict depth velocity acceleration directly and do it with a very high fidelity matching that of radar so the approach we’re going to take of course is we’re going to treat this as a surprise learning problem and we need a massive data set of depth velocity acceleration on a lot of cars and we’re going to train a large enough neural network and do a very good job at that that’s the standard vanilla approach and that actually works really well if you do it properly so here’s what it took um so the first component that you need is an incredibly good data set so when i talk about a good data set i believe it has to have three properties that are critical it has to be large and millions of examples it needs to be clean so in this case we have to have a very clean uh source of depth velocity and acceleration for all these cars and it must be diverse so we’re not just talking about a single you know driving club of going forward on a highway we really have to get into the edge cases and mine all the difficult scenarios and i’m going to show you some examples and when you have a large clean diverse data set um and you train a large enough neural network on it what i’ve seen in practice uh is in the words of ilya saskar saskar success is guaranteed and so let me show you some examples of that so how are we going to so first of all how are we going to achieve this data set so of course we need to collect training data the typical approach might be to use humans to annotate cars around us in three dimensions um what we found actually works really well is an auto labeling approach so it’s not pure humans just like annotating cars it’s it’s an offline tracker as we call it and it’s an auto labeling process for uh collecting data at the scale that is necessary uh so we need again millions of hard examples so this is where the scale comes from is that it’s not labeled purely by humans although humans are involved it’s labeled automatically so here’s an example of some automatic labels we were able to derive four cars on the highway and the way you do this is because you’re offline and you’re trying to just annotate a clip you have a large number of benefits that you don’t typically have if you’re at test time under strict latency requirements in the car so you can take your time to fully figure out exactly all the objects and where they are you can use neural networks that are extremely heavy they are not deployable for various reasons you can use benefit of hindsight because you know the future not just the past you can use all kinds of expensive offline optimization and tracking techniques you can use extra sensors in this case for example actually radar was one of the sensors that we used for the auto labeling but there’s actually a massive difference between using radar at test time and using it in the offline tracker and that is that at test time a radar can just report a stationary track immediately in front of you and you have 20 milliseconds to decide if you’re gonna break or not if you’re offline you have the benefit of hindsight and so you can do a much better job of calmly fusing if you really want to and so there is a massive difference there and in addition you can of course involve humans and they can do cleaning verification editing and so on um and so we basically found that this was a massive lever to allow us to reach the scale and the label and quality and then to get the diversity uh that was also a massive fight so here’s some examples of really tricky scenarios so here there’s some i don’t actually know exactly what this is but basically this car um drops a bunch of debris on us and we maintain a consistent track for the label and of course if you have millions of labels like this the neural net if it’s a powerful enough neural net uh will actually end up learning to persist these tracks in these kinds of scenarios here’s another example there’s a car in front of us and i actually am not 100 sure what happens in this case but as you’ll see there’s some kind of a dust cloud that develops here and briefly occludes the car but in the auto labeling tool we are able to persist this track because we saw it before and we saw it after so we can actually stitch it up and use it as a training set for the neural net here’s one more example of an adversarial scenario from heavy snow again we can auto label this fine and create a large collection of samples now as we were working on this over a duration of i want to say roughly four months of really just focusing a lot of the team on this problem of achieving really good depth velocity and acceleration we’ve ended up developing 221 triggers manually that we were using to source data from our customer fleet and this is just an example um some examples of the 221 triggers that were used to collect all of these uh diverse scenarios so for example we have um shadow mode where we deploy a neural network that is pretty good at predicting depth and velocity and what we do is we run it silently in the cars of our customers but it’s not actually connected to control what’s driving is the legacy stack but we’re we’re basically running the new measurements of depth and velocity and we’re for example looking at whether or not they agree or disagree with the legacy stack or with the radar we’re looking for other sources of gen like for example if there’s bombing box jitter detection jitter the main and the narrow camera disagree we predict that there’s a harshly decelerating object but the person seems to not mind um all kinds of disagreements between different neural network signals and you know there’s a there’s a long list here it took us a while to actually perfect these triggers and all of them are iteration and you’re looking at what’s coming back you’re tuning your trigger and uh you’re sourcing data from all these scenarios so basically over the last four months we’ve done quite extensive data engine we’ve ended up doing seven shadow modes and seven loops around this data engine here where on the top right is where you begin you have some seed data set you train your neural network on your data set and you deploy the neural network in the customer cars in shadow mode and the network is silently making predictions and then you have to have some mechanisms for sourcing inaccuracies of the neural net so you’re just looking at its predictions um and then you’re using one of these triggers like i described you’re getting these scenarios where the network is probably misbehaving some of those clips end up going to unit tests to make sure that we even if we’re failing right now we make sure we pass later and in addition those examples are being auto labeled and incorporated into a training set and then as a asynchronous process we’re also always data cleaning the current training set and so we spin this loop over and over again until the network basically becomes incredibly good so in total we’ve done seven rounds of shadow mode for this release um we’ve accumulated one million extremely hard diverse clips and these are videos so these are you should roughly think about say 10 second clips 36 fps something like that in total we have about six billion objects labeled cleanly for depth and velocity and this takes up roughly 1.5 megabytes of storage uh so that gives us a really good data set of course that by itself does not suffice so we have an incredibly good ai team of uh that is designing the neural network architecture there’s basically the layout of the synthetic visual cortex in order to efficiently process this information so our architecture roughly looks like this we have these images coming from multiple cameras on the top all of them are processed by an image extractor like a backbone like think resnet kind of style then there’s a multicam fusion that fuses the information from all the eight views um and this is a kind of a transformer that we use uh to fuse this information and then we fuse information first across all the cameras and then across all of time and that is also done either by transformer by the current neural network or just by three-dimensional convolutions we’ve experimented with a lot of kind of fusion strategies here to get this to work really well and then what we have afterwards after the fusion is done is we have this branching structure that doesn’t just consist of heads but actually we’ve expanded this over the last few last year or so where you now have heads that branch into trunks that branch into terminals so there’s a lot of branching structure and the reason you want this branching structure is because there’s a huge amount of outputs that you’re interested in and you can’t afford to have a single neural network for every one of the individual outputs you have to of course amortize the forward pass for efficient inference at this time and so there’s a lot of feature sharing here the other nice benefit of the branching structure is that it decouples at the terminals it decouples all these signals so if i am someone working on velocity for a particular object type or something like that i have a small piece of neural network that i can actually fine-tune without touching any of the other signals and so i can work in isolation to some extent and actually get something to work pretty well and then once in a while so basically the iteration scheme is that a lot of people are fine-tuning and once in a while we do an upper of all of the backbone end-to-end and so it’s a very interesting uh sort of mechanism because we have uh you know a team of roughly 20 people who are i would say training neural works full time and they’re all cooperating on the single neural net and so what if the workflow by which you do that is pretty fascinating and continues to be a challenge to actually design efficiently so we have a neural network architecture we have a data set now training these neural networks like i mentioned this is a 1.5 petabyte data set requires a huge amount of compute so i wanted to briefly give a plug to this insane supercomputer that we are building and using now um and uh you know for us computer vision is the brand and part of what we do and what enables the autopilot and uh for that to work really well you need a massive data set we get that from the fleet but you also need to train massive neural nets and experiment a lot so we’ve invested a lot into the compute in this case we have here a data center we have a cluster that we’re just building that is 720 nodes of atx a100 of the 80 gigabyte version so this is a massive supercomputer i actually believe that in terms of flops this is roughly uh number five supercomputer uh in the world so there’s actually a fairly significant uh computer here um we have 10 petabytes of hot or nvme storage and uh it’s also an incredibly fast uh storage so i believe 1.6 terabytes per second this is one of the world’s fastest file systems and we have a 60 we have also a very uh efficient fabric um that connects all of this because of course if you’re doing distributed training across your nodes you need your gradients to to be synchronized very efficiently and of course we are reading all of these videos from the file system and that requires uh really fat bite as well so uh this is a pretty incredible supercomputer um and uh so this is a gpu cluster uh next up we’re really hoping that we’re currently working on project dojo which will take this to next level uh but i’m not ready to sort of reveal any more details about that at this point what i would like to do is i would like to just briefly plug the supercomputing team uh they’ve been uh growing a lot um and so uh if if high performance computing for this application and for training these uh crazy neural networks excites you then we definitely appreciate more help so please contact the super computing team at supercomputing and tesla to help us build these clusters on the left here we have the compute nodes and on the right this is a network switch actually so the wires here is kind of like the white matter of this uh synthetic cortex i guess uh the other thing i wanted to briefly talk about is um i wanted to also mention briefly that um this effort basically is incredibly vertically integrated uh in the ai team so as i showed you we own the vehicle in the sensing and we source our own data and we annotate our own data and we train our on-prem cluster and then we deploy all of the neural networks that we train on our in-house developed chip so we have the fsd computer uh here that has two socs fse chips here and they have our own custom npu neural processing unit here at roughly 36 tops each and uh so these chips are specifically designed for the neural networks that we want to run for fsd applications and so everything is very vertically integrated in the team and i think that’s pretty incredible because you get to really co-design an engineer at all the layers of that stack and uh there’s no third party that is holding you back you’re fully in charge of your own destiny which i think is incredibly unique and very exciting and then we have a deployment a pipeline here where we take these neural networks and we do a lot of typical graph optimization fusion quantization threshold calibration and so on and deploy it on our chip and uh we and this is in the customer’s cars and these neural networks are under so now what i’d like to show is i first would like to show some qualitative examples of some of our results in terms of the depth and velocity predictions that we’re able to achieve by putting all these pieces together and training these networks at scale so the first example here i have a video where this is on track testing so this is an engineering car and we asked it to slam on the brakes as hard as it possibly can so this is a very harsh braking here in front of us even though it doesn’t look like that in the videos this is very heartbreaking um so what you can see on the right here is you can see the outputs from the legacy stack which had radar vision fusion and from the new stack which is vision alone in blue so in the orange legacy stack you can actually see these uh track drops here when the car was breaking really harshly and basically the issue is that the braking was so harsh that the radar stack that we have actually ended up uh not associating the car and dropping the track and then reinitializing it all the time and so it’s as if the vehicle disappeared and reappeared like six times during the period of this breaking and so this created a bunch of artifacts here but we see that the new stack in blue is actually not subject to this behavior at all it just gives a clean signal in fact here there’s no smoothing i believe on the blue signal here this is the raw depth and velocity that comes out from the neural net the final neuron uh that we released with about three weeks ago and you can see that it’s fairly smooth here and of course you could go into the radar stack and you could um you know adjust the height parameters of the tracker like why is it dropping tracks and so on but then you are spending engineering efforts and focus on a stack that is like not really barking up the right tree and so it’s better to again just focus on the vision and make it work really well and we see that it is much more robust when you train it at scale then something like this here’s another example um fairly infamous example of slowdowns when there are cars going below a bridge the issue here again is that the radar does not have too much vertical resolution so radar reports a stationary object in front of you it’s as it like the radar doesn’t know that there’s if it’s there’s like a stationary car in front of you or if it’s the bridge but it cannot differentiate those two so radar thinks that there might be something stationary in front and it’s just like looking for something in vision to tell it that it might be correct and then we create a stationary target and break and so in this case in the legacy vision uh predictions which were already producing depth and velocity but because we were using radar the vision inaccuracies were being masked because um your bar is only at radar association your bar is not at actual driving and so the depth and velocity were not held up to high enough bar and so basically what happens is vision reports a slightly for a few frames reports a slightly too negative velocity for the car and then it associates to the stationary object and the stack is like oh that must be the stationary thing and then you break and so um this of course is much cleaner and you see that the new stack does not see this at all and there’s no track there’s no uh slowdowns in this case um because we just get the correct depth and velocity and vision obviously has the vertical resolution to differentiate a bridge from a car and whether or not the car is slowing or not so again you could go into the vision step the radar stack or the sensor fusion stack and if you have an improved depth and velocity you could change the fusion strategy uh but again you’re just kind of like doing dead work like this signal is so good by itself why would you why would you do that um so in this in this setting now we’ve we’ve improved the situation quite a lot here’s another last example um we have a stationary approach again this is in track testing environment uh this is an example of a test that we would run and we are just approaching this vehicle i’m hoping to stop what you see in orange in the legacy stack is that it actually takes us quite a bit of time for us to start slowing and basically what’s happening here is that the radar is very trigger happy and it sees all these false stationary objects everywhere like everything that like sticks out is a is a stationary target and radar by itself doesn’t know what actually is a stationary car and what isn’t so it’s waiting for vision to associate with it and vision if it’s not held up to a high enough bar is noisy and contributes um sort of error and the sensor fusion stack just kind of like picks it up too late and so again you could fix all that even though it’s a very gross system with a lot of if statements and so on uh because the sensor fusion is complicated because the error modes for vision and radar are slightly are quite different but here when we just work with vision alone and we take out the radar uh vision recognizes this object very early gives the correct depth and velocity and there’s no issues so we actually get an initial slow down much earlier and we’ve really like simplified the stack a lot and as well so just speaking very briefly to the release and validation um we’ve extensively validated this before we of course uh ship this to customers uh just some example uh numbers for the validation itself uh we’ve hand picked six thousand clips in about 70 categories of scenarios like harsh breaking crossing vehicles uh different vehicle types environments and so on and we run these uh tests on all the commits of the build and we run them also every day periodically as we were trying to drive up the performance on these clips we also use simulation extensively for validation we’ve also used simulation for training although i haven’t gotten into that but we’ve had some successes there but we used primarily for validation at this stage and that’s again like diversity of scenarios trying to make sure that the stack is performing correctly we’ve done a lot of track testing and i’ve shown you some examples of that we’ve driven this extensively in the qa fleet and we’ve also deployed this in shadow modes and seen that this stack performs fairly well so in particular for example for the automatic emergency braking we actually see a higher precision and recall compared to the legacy stack that is a fusion stack and so having seen all this we’ve actually released this we’ve accumulated about 15 million miles so far and 1.7 million of those have been on autopilot and so far there have been no crashes now of course we’re running at a massive scale here and so we do expect some crashes at some point uh the legacy stack uh has a crash roughly every five million miles or so i believe and so we do expect some some ascendance uh um at some point but this uh the improvements for the vision stack are not sort of stopping so i think we’re very confident that uh we’re barking up the right tree here and that we can actually get this work really incredibly well um i also want to briefly give a shout out to other labeling as an incredibly powerful method for sourcing training data so we’re applying auto labeling to all of the fsd tasks not just the depth and velocity of the car in front of us here’s an auto labeler for pedestrians and you can see that the tracks are very smooth and this is again a completely offline process that we are applying not just to objects but also to a lot of static environments so this is some example of a 3d reconstruction so here we see a clip and this is a vision only estimate of the depths of all the points and of course you don’t actually want to really deploy these measurements directly into a car because they are too raw so what you actually want to do is you want to again auto label and you want to really think about what is it what is the information i really need at test time in the car and what is like as close as possible to action right so you want to do all this work at training time figure out exactly what happened in the club reconstruct the entire environment and then you want to actually like post-process the curb at training time so you want to do as much a training time and labeling time as possible and then you want to figure out what are the tasks i really need at test time and that’s the thing that happens in the car you know subject to very strict latency requirements you want to do as little processing there as possible so we see that this is a very powerful lever on occurring dss so in summary what i try to argue and give you a sense of is vision alone is actually in our is our finding that this is perfectly capable of death sensing um it is an incredibly rich sensor in terms of bandwidth of information and doing this and matching radar performance and depth and velocity is incredibly hard i believe it requires the fleet because the data set that we were able to achieve was critical in all these performance uh improvements and if you do not have the fleet i’m not 100 sure how you can source all the difficult and diversity scenarios that we did source um because i believe that was critical to getting this to work so it’s hard requires massive networks a super computer and a data engine and the fleet but all these components are coming together in um in a vertical integrated fashion at tesla ai and i believe that this makes us uh sort of uniquely positioned in the industry where um we are barking up the right tree and we have all the puzzle pieces to make to make this work and so if you are excited about the leapers that we’re taking and uh the networks and the work that we’re doing then i would encourage you to please apply uh and join the team and help us make this a reality so you can go to tesla.com autopilot ai it’s a very simple process uh you upload a resume and you you have a blurb about some of the things that you’ve done that are impressive and then that comes directly to us and uh we’d love to work with you to make this reality thank you thanks for this very very interesting talk uh we actually got quite some questions uh since we are already over time i will limit it to a few we are at the you know the kind of nice point at the end of the workshop so you are a bit of a rough spot that you have to answer more questions than maybe other speakers um and kind of i would like to go as a first one into the you know kind of the auto lay maybe bit to auto labeling but actually the thing which really is necessary for the auto labeling and this is these um you know this definition of these triggers uh did you ever try to investigate if you can automatically automatically uh generate those triggers um that’s an interesting question um so the triggers are kind of like they’re designed based on what we’re seeing and what’s coming back in the telemetry from the cars so we are seeing clips where maybe we we didn’t break what the person did or vice versa there’s all kinds of disagreements with like human driving and the legacy stack and we’re looking at this and typically we try to have signals of various um generality so for example like a vision radar uh disagreement is a very general trigger it can source all the disagreements um and then there are very specific ones like maybe we struggled actually for a while when we were entering tunnels and exiting tunnels because there was a lot of brightness variation so we struggled a lot there as an example and we had to specifically design some triggers for that because we were not catching it uh with a high enough frequency um for any of these situations you actually need to make sure they’re represented well in the training step so for us this was a fairly manual process but we do have a team basically dedicated to doing this full-time and they did this for four months so i don’t like how do you automate this i think is a very tricky scenario because again you can have these general triggers but i think they will not correctly represent uh the error modes and i think it would be really hard to for example automatically somehow have a trigger um that triggers for entering and exiting tunnels that’s something semantic that you as a person have to like intuit that this is a challenge for whatever reason and actually go after it specifically so um yeah it’s not it’s not clear how that would work but i think it’s a really interesting idea yeah uh so i thought it sounded the very interest the question sounded very interesting and not easy to answer yeah uh maybe another one which actually in somehow let’s say two version uh showed up a bit was kind of in your vision only approach um kind of could there be other sensor like you know for example thermal uh thermal imaging you know kind of thermal cameras uh which could help you in low light situation or actually even the radar which could kind of give you some very specific information i think there was an example of radar signals bouncing through basically the car before you and then you still get somehow a signal of the car you know one what a head things like that you think really vision vision is enough or kind of you know maybe some special sensors which could give you a certain edge yeah i think it’s an interesting idea i would say um like clearly people people use vision and visual spectrum and they make it work and so there’s an infinity sensors of varying economic costs actually vision and uh in the visible light spectrum is incredibly cheap sensor right so the economics of this are very uh appealing and you can actually uh manufacture and include this at scale so it’s a very cheap and uh nice sensor and i would say that basically we have proof from humans that it is um sufficient and i would say also that um vision has all the information uh for driving so i would say that basically vision is necessary and sufficient in my mind and um so we’re primarily focusing on that and you could go crazy with a lot of sensors but uh i think we’re definitely currently just doubling down on that alone yeah thanks a lot um i think we are now already more than five minutes overtime uh so i let you go and don’t ask other evil lidar versus camera questions and um with this i would actually like to give some uh kind of concluding remarks on the workshop uh or better uh you know move over to the last part so uh before yeah i would really like to thank the whole team without them this would have never been possible

Twitter Facebook LinkedIn