e2e

#1. Why End-to-End Learning
The total level of self-driving can be devided into 6 levels from L0 to L5. Nowadays, the level of self-driving car has reached to L3, which still requires some participation of drivers and cannot realize complete self-driving. We believe the presentation of end-to-end learning has pointed out a new way to a higher level of self-driving, and it can replace some technical redundancy of current self-driving solution.

Due to the inspiration from the paper "Variational End-to-End Navigation and Localization", we designed and reproduced the network presented by the paper and got some useful result.
###1.1 Traditional Self-Driving Solution 
In term of traditional self-driving solution, the trace of vehicle is often determined by other techs like detection of lanes, recognization of obstacle and the dependency of high-precision map. The requirement of facilities and the cost of computing is usually too high to afford.

What is more, when the required reference is blurred or lost, the driver has to take over control of the vehicle to avoid the following wrong routes. However, for higher standard of self-driving like L4 or L5, these defects should be overcomed to make sure vehicles can make a correct decision whatever the condition of road is. Therefore, for the next stage of development, we should get rid of these limitations and provide a solution only via a visual way. So we get to End-to-End learning.
###1.2 What Is End-to-End Learning
According to Wiki,

> "End-to-end learning process is a type of Deep_learning process in
> which all of the parameters are trained jointly, rather than step by
> step. Furthermore, just like in the case of Deep_learning process, in
> end-to-end learning process the machine uses previously gained human
> input, in order to execute its task."

As its name suggests, end-to-end learning indicates that given an input of the captured data, our model is supposed to provide a direct command of control without complicated intermediate process. For self-driving field, end-to-end learning is a complete visual solution which accepts images data from cameras and outputs a direct control command like speed and steer as human driving. Concretely, for our project，we use curvature for the final output.
###1.3 Feasibility
Since end-to-end learning is presented as a totally new way to perform self-driving, there may be some doubt for its feasibility. Due to the fact that this kind of sulution only extract information from origin images, an intuitive question may be: How can we make sure if the network really realize what it has learned? Actually, Nvidia's PilotNet has answered this kind of concern perfectly. PilotNet is a kind of end-to-end learning network, and they explored this problem by moving the reference that will actually influence the behavior of vehicle such as lanes. Below is the detail and result for the experiment.
<center>![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf04)</center>
<center>Figure. 1.1: Images used in experiments to show the effect of image-shifts on steer angle</center>

----------

<center>![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf07)</center>
<center>Figure. 1.2: Plots of PilotNet steering output as a function of pixel shift in the input image</center>

----------

As it can be seen, they apply displacement to salient objects, background and whole image, using Inverse-R as the metric for measurement. After the displacement of salient object and the whole image, the result has changed dramatically while it only changed a little when applying to background. From this perspective, we can conclude that the end-to-end pipeline can be aware of what it is learning and which factors contribute to the final result mostly.

At last their final conclusion is: "We further provide evidence that the salient objects identified by this method are correct. The results substantially contribute to our understanding of what PilotNet learns.", which proved the feasibility of end-to-end learning.
#2. Artechiture of Model
In general, our model is a multiple-input and multiple output model. Concretely, we use 5 same-time images as one batch of input, which includes images from front left camera, front camera, front right camera, an unrouted map and a routed map. For output part, it generates two lists of useful information which include a probabilistic control output and a deterministic control output. In the paper, it is said that "We refer to steering command interchangeably as the road curvature: the actual steering angle requires reasoning about road slip and control plant parameters that change between vehicles, making it less suitable for our purpose. Finally, the parameters (i.e. weight, mean, and variance) of the GMM’s i-th component are denoted by ($\phi$ , $\mu$ , $\sigma^2$ ), which represents the steering control in the absence of a given route."

In the term of pratical meaning of each variable, it is assumed that there are always three possible available routes at every moment when driving the vehicle, which indicate left turn, straight forward and right turn separately. Weight component represents the possibility of choosing each road, mean component represents the average curvature each road contributes to the final result and variance component reflects the degree of dispersion of mean.

The whole network is shown as below:
![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf08)
<center>Fig. 2.1 Model architecture 
Source: Amini A., et al. Variational End-to-End Navigation and Localization</center>

----------

### 2.1 Camera Models
For the part of camera models, like I mentioned before, our model accepts a set of three pictures. After series of operations of convolution (without pooling), each generated tensor is flattened as a n*1 tensor and concatenates with each others and the temporary result is transferred towards deeper network for segmentation processing.

### 2.2 Map Models
One of the most novel part of the paper is introducing map as training data which can perform localization with heading and position infomation. In our project, we used Open Steet Map (OSM) as the map engine to project the trajectory on. Concretely, we used osmnx for cropping the map and cv2 for rotating the cropped map in order to perform a first-person perspective. After collecting the handled pictures of maps, we feed them in to a CNN model. For unrouted map part, it is fed into the first part of model and concatenated with camera models' tensor to form a new one-dimensional tensor which contains the combination of merged information.
### 2.3 Variational GMM
For this part, we use a Gaussian Mixture Model (GMM) with K = 3 modes to describe the possible steering control command, and penalize the $L_{1/2}$ norm of the weights to discourage extra components. Instead of adding a self-defined layer of GMM, we use a fully-connected layer to simulate GMM by changing the loss function. From the perspective of a single component, there are three outputs which are weight, mean and variance. We achieved this part by setting a n*3 dense layer and output them in position. 
### 2.4 Loss Function
By the variational model, the network imitates the parameters of GMM. $\phi_i, \mu_i, \sigma_i$ donate the weight, mean, and variance to the $i$-th Gaussian distribution of the GMM. $I, M, \theta_s, \theta_p$ donate the images from cameras, unrouted and routed map, the probability distribution of steering commands, and deterministic steering commands.
$$
Loss = \mathcal{L}(f_s(I,M,\theta_p), \theta_s)+ \left\lVert\phi\right\rVert_{p}+\sum_{i}\psi_s(\sigma_i)+(f_D(I,M,\theta_p)-\theta_s)^2
$$
where
$$
\begin{align}
\mathcal{L}(f_s(I,M,\theta_p), \theta_s) 
&= \sum\theta_s\log P(\theta_s|\theta_p,I,M)\\
&= \sum(\theta_s\log\sum_i\phi_i (f_s)_i(\theta_s))\\
&= \sum(\theta_s\log\sum_i\phi_i \mathcal{N}(\mu_i,\sigma_i))\\
\end{align}
$$
is the negative log-likelihood of the steering command according to the GMM paremeters.

> Tips: In our case, the cross entropy has a better performance than the negative log-likelihood. Thus, here the function is actually a cross entropy.

And
$$
\psi_s(\sigma) = \left\lVert \log\sigma-c\right\rVert ^2
$$
is the regularization to the variance, $c$ is a constant.

$(f_D(I,M,\theta_p)-\theta_s)^2$ is the mean squared error of the deterministic steering commands (predictions), and $\left\lVert\phi\right\rVert_{p}$ is the regularization to the GMM weights in $p$-th norm.
#3. Experiments
In this part, we use the model below for training:
<center>![](https://leanote.com/api/file/getImage?fileId=6517953aab644179368cdf03)</center>
<center>fig. 3.1 architecture of the whole model</center>

----------

###3.1 Input Data
As camera inputs, the origin size of our pictures is 640*402*3 which is too large for our model to train, therefore, we resize images into 200*80*3 and do normalization by dividing 255. 
<center>
![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf05)![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf06)![](https://leanote.com/api/file/getImage?fileId=6517953aab644179368cdf02)
</center>
<center>Figure. 3.2: Images from camera input</center>

----------
For map inputs, by using osmnx, when we get gps infomation of the vehicle, we can get the road condition and project the trajectory on the map.

<center>![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf0d) ![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf0b)
</center>
<center>Figure. 3.3: Map before and after projection</center>

----------

After having the entire route, we can zoom in and crop the map images. Since the generated maps' size is 50*50*3 so we do not need to do any modification.
<center>![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf0c) ![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf09)</center>

<center>Figure. 3.4: Map input</center>

----------

###3.2 Result
Below is our final result trained on 1,000 samples and tested on the same dataset. The lable is pushed in advance so that our model can have a better performance on prediction after a few seconds. The $R^2$ metric is about 0.86. 
<center>![](https://leanote.com/api/file/getImage?fileId=6517953bab644179368cdf0a)</center>
<center>Figure. 3.5: Test result</center>

----------
<center><iframe height=420 width=640 src="https://i.ibb.co/Rbnfht5/mnggiflab-compressed-mnggiflab-compressed-mnggiflab-compressed-test-1.gif">
</center>
<center>Figure. 3.6: Sample video for prediction</center>

----------

#4. Potential Problems
Although we use very few data to get a well-performed result, there are still some potential problems that cannot be avoided:

1. This end-to-end learning pipeline needs a huge amount of data to train in case of encountering complicated road conditions.
2. Speed control is a tough object, because in this project, the speed is considered to be constant. And when turning at a corner, a small change of speed will lead to a big change for trajectory.
3. Obstacle avoidance is also one of the difficult part without sufficient training samples or corner cases.

#5. Conclusion
In this project, we vertify the feasibility of end-to-end learning pipeline and get an accurate result for prediction. What is more, this kind of solution can solve part of drawbacks of traditional self-driving strategies, and get rid of the high cost of both facilities and computing. All in all, we consider it a novel solution for L4/L5, and will replace the old solutions.