A perception centred self-driving system without HD Maps

Building a fully autonomous self-driving system has been discussed for more than 20 years yet remains unsolved. Previous systems have limited ability to scale. Their localization subsystem needs labor-intensive map recording for running in a new area, and the accuracy decreases after the changes occur in the environment. In this paper, a new localization method is proposed to solve the scalability problems, with a new method for detecting and making sense of diverse traffic lines. Like the way human drives, a self-driving system should not rely on an exact position to travel in most scenarios. As a result, without HD Maps, GPS or IMU, the proposed localization subsystem relies only on detecting driving-related features around (like lane lines, stop lines, and merging lane lines). For spotting and reasoning all these features, a new line detector is proposed and tested against multiple datasets.


I. INTRODUCTION
Ziegler's system [1] can drive full-autonomously over 100 kilometers without any interruptions in 2014. Despite these early achievements, the industry leaders are still struggling to pass the necessary tests according to [2]. It is critical to inspect why the current self-driving system is hard to implement and widely used. Current systems rely on HD Maps to produce centimeter-level accuracy of position. Readers are referred to [3] for more about typical system architecture. The big question is whether accurate positions necessary?
Human drivers make driving decisions based on what they see. They make sense of the environment around and decide when to turn or keep the current driving direction. They cannot mark the exact position of themselves on a map, but they know how to travel through a complicated intersection based on the knowledge of which way they should take. Likewise, a selfdriving system without accurate locations should be a viable solution?
In this paper, a new perception centered self-driving system is proposed and discussed in two driving scenarios: the cruising scenario and the turning scenario. The cruising scenario is when the vehicle cruises on parallel lanes. The turning scenario is when the vehicle drives through free spaces (defined as the drivable area outside of lanes, like intersections or parking area).
The proposed system comes with several advantages in these two scenarios. Firstly, it does not rely on HD Maps. So it is easy to scale without recording new HD Maps.
Secondly, the proposed feature detection method is not based on any specialized end-to-end deep learning solutions. Hence it is easy to debug and visualize. Also, it does not need additional time-consuming training process for scaling. Lastly, it performs more robustly with a severely changed environment (like seasons, weather or lighting condition).
Just like the human drivers, the system only involves with related visual features (defined as traffic features, including traffic lines, traffic lights and traffic signs). The workflow of the detection and localization subsystem is shown in Fig. 1. In the cruising scenario, only the first step is needed, including 1.1 and 1.2. In the turning scenario, all four steps must be done. Note that the vehicle position from the localization subsystem is based on the rebuilt scene rather than a global map. The localization subsystem also projects the rebuilt scene onto a digital map (like Google Map) to provide navigation instructions while crossing free spaces. The navigation instruction leads the car to travel from one exit to the target entrance of the free space. The path planning system and control system also works on the rebuilt scene. Hence they are map unrelated.
The proposed system relies on traffic lines (including curbs) for tracking the vehicle's position. Hence, the lines detector is the priority. A general lines detector for understanding complicated traffic lines on the road is vital. The experiment covers several types of lines, including lane lines, stop lines, curbs, merging and splitting lines and intersections in a roundabout. For the popular lane lines detection problem, the proposed new traffic lines detector performs as good as other deep neural network supported approaches leveraging the prior knowledge of lines position and angles with easy erosion and clustering. This robust and straightforward method is then generalized and successfully detected other kinds of lines as well. After that, the process of localizing the position in the rebuilt scene will be discussed with examples and limitations. In that example, the system requires neither GPS signals nor IMU signals nor 3D HD Maps to locate the vehicle.

II. RELATED WORK
What is a perception centered self-driving system? Most self-driving systems are relying on a map-based localization subsystem. They are categorized as localization centered systems because all other subsystems are working under the map space from the localization subsystem. The perception centered system uses a local scene, instead of a global map, as the working space for all other subsystems. Limited research have been done on this direction. One of the exceptions is [4]  by Bojarski from Nvidia. In this work, they tried to build an end-to-end system from camera images to control signals with the help of augmented learning. It is also map-unrelated. However, this system only works for minimal lane-keeping tasks in the cruising scenario. It is not compatible to work with other subsystems, and the scalability is not tested for more sophisticated roads or sensor settings.

A. Localization
For most localization centred systems, all decision making and path planning are based on a centimetre level localization accuracy from their localization subsystem. Using GPS, with the aid of IMU, is a popular solution and provides accuracy better than 20 centimetres with SLAM over an HD Map [5]. The problem of GPS is that the signals are not always available, and the result tends to drift accidentally. For quite a long time, SLAM is considered as the key to solving the localization problem for self-driving cars. The SLAM algorithm uses visual features stored in the HD Map to match features extracted from the live camera on the self-driving cars. Visual features are usually organized as bags of features (BoF) in the descriptor space. Without HD Maps or IMU, researchers can hardly reach the centimetres level accuracy like [6] and [7].
However, two problems of the SLAM based localization approach are tricky to solve. Firstly, the performance decreases once the environment changes. Light angle changes might cause different shadow shapes and season changes cause massive appearance changes on the trees and grass. These changes yield new visual features which cannot be matched with the recorded ones on the HD Map. This problem requires routine labour-intensive map recording once after the changes occur. Secondly, the localization result tends to drift after a long-range driving, and the error will accumulate with growing driven distance, as discussed in [5]. The intrinsic reason of these problems is that the original SLAM algorithm is designed for indoor localization problems where dramatic environment changes or long-distance moving is not considered. Hence these problems are hard to eliminate.
Recent researchers, like Ma [8], started to use as less visual features as possible for localization. Besides saving the storage for the BoF of these features, using fewer features decrease the risk of being affected by the environment changes [9].
This trend brings the idea of using minimal features for localization. The LaneLoc system proposed by Schreiber [10] tried to use the exact appearance of lane markings for matching from pre-recorded maps. This approach could be seen as counting the number of dashed fragments the vehicle travelled to localize the car itself. This approach still has several limitations. Firstly, it will not work on a solid line situation and ends up with only relying on IMU without any visual aids. Secondly, the exact appearance will eventually change one day in the future. Think about the time when those dashed lines were repainted or worn out, which are both prevalent cases. Thirdly, the performance is very fragile. Slight turbulence, like occlusions or heavy shadows, will make the system omit one or more fragments and yield a steady error as a result. Lastly, the labelling process is both complicated and hard to finish accurately, as discussed by Schreiber in their paper. The proposed system solved these limitations by abstracting line features further to types and directions by the proposed lines detector.

B. Traffic Line Detection
The traffic line detection, or the lane detection which is a narrower problem, was the essence of many early driving assistant systems [11] like Lane Departure Warning System (LDWS) and Lane Keeping Assist System (LKAS). Many researchers, like Kim [12], used Convolutional Neural Network (CNN) to reduce noise and get the segmentation of the markings of those lines. Wang [13] used shape extracted from OpenStreetMap (OSM) as prior knowledge to help detect the lanes. Some problems remain for the CNN supported approaches.
Firstly, they still can not solve the long-tail challenging situations because CNNs heavily relies on the distribution of the training dataset. As a result, CNN generally works terribly in rare situations. Secondly, the segmentation result of the CNN approaches often cause blurry edges when it is not confident about the prediction. These blurry edges come with difficulty for the following algorithms when they try to form a line from these ambiguous pixels. Lastly, CNNs are significantly dataset related. They tend to work well only on the dataset they have been trained on [14]. This limitation is because that different datasets and sensor settings tend to create distinctive patterns of noise in the images. For example, in the KITTI dataset [15], the same line marks show different appearances in different locations under the BEV space. Lines far from the camera shows clear artifacts caused by the BEV transformation. The self-driving related datasets are often covering just one type of the available camera settings. A vast and comprehensive dataset like MS-COCO [16] for the object detection task does not exist for now.
As a result, CNN was not used for lines detection in this paper. The proposed lines detector leverages the lines information from a topology map, similar to what Wang did in [13] from the OSM, as prior knowledge to help. The proposed lines detector separates different line types to boost the performance even more by using different lines detector for each type of lines (solid or dashed lines, straight or curved lines). It also used a sliding window to detect and connect traffic lines, similar to what Tsai did in [17]. The sliding window approach is proved to be both robust and easy to visualize for debugging.

III. SYSTEM DESIGN
The overall workflow is shown in Fig. 1. In the cruising scenario, the detection subsystem will finish the part 1.1 and 1.2 to give the current lane number of the vehicle, and that is enough for generating a driving path and control signal without involving the localization system at all. However, the detection system needs to continuously detect the traffic features for the next traffic part (could be another lane ahead or a free space connected with an exit). The order of the series of traffic features are based on the topology map.
The topological map, being used as the descriptor space for matching with the digital map and the rebuilt scene, is the center and the relationship is shown in Fig. 2. The topology map should be drawn before the system can run on a new area. The topology map also provides lane information helping lines detection as prior knowledge and helps the vehicle to change to a preferred lane in advance. The topology map contains the following information: Each turning point on the digital map is used for finding a nearest entrance-exit pair which have the correlated directions. Define T = {(λ t , φ t ), α t , β t } as the set of all turning points on the digital map, where λ t and ϕ t is the latitude and longitude of turning point t, α t is the direction before the turning and β t is the direction after the turning.
are the set of all entry points and all exit points. The score function f is the multiplication of g and h, as equation 1, where g is the Euclidean distance between two points and h is the difference of two angles, defined as } is the set of all legal pairs of entrance and exits. All legal pairs should connect with a same free space and follow the traffic law. For example, the exit on the end of a right turning lane cannot pair with the entrance ahead with the same direction. The optimal pair for a minimal f score is the matched result with the condition of (d * in , d * out ) ∈ P . This method assumes the turning point on the digital map is the center point of the target exit and the target entrance.
The data of P and D are manually initialized as part of the topology map. These data usually do not need to be changed unless the traffic features are changed. For example, an intersection was updated with an additional right changing lane or new construction on the road updated the lane changing rules temporarily. The maintenance of the topology map is easy and fast since the only parts need to be changed in the sets of P and D are the data of the lanes.   For the turning scenario, the detection subsystem only needs to detect one pair of non-parallel lines to form an anchor to rebuild the scene. For example, under the intersection scenario shown in Fig. 4, the middle lane line and the stop line are enough for a strong anchor to rebuild the scene based on the given relative position from the topology map. The target entrance on the right side can be predicted and used for path planning. Once the vehicle has driven into the free space passing the stop line which will no longer be detected, the stop line of the target lane will be detected and provide a strong anchor to follow up. The starting point of the target lane will form a weak anchor as additional clues for localization.
The detection of anchors might be effected by occlusions caused by other objects on the road. In other situations, there is a chance when the vehicle is crossing a large intersection, the vehicle will have no available anchor in sight in some area. The target lane direction and the current drivable area, as a backup, will aid the vehicle to finish the turning. The free space situation ends with positive detection of the next detectable lane set. If there are multiple lines parallel with each other nearby, the system assumes the detected one is the nearest one based on the current lane level position.
The system needs to be initialized at the beginning of each run based on GPS signals and the current driving direction from the gyroscope to tell the system which lane the vehicle is on. The GPS signal does not need to be centimetre-level accurate, and the detection subsystem will update the lane number, relying on counting the line numbers between the vehicle and the detected curbs. This paper does not cover behaviour decision among crossing lanes because this can be considered as a separated and solved problem thanks to previous research like [1]. This behaviour decision includes behaviours, like yielding to vehicles coming out from other merging lanes. These rules are universal and consistent.

C. General Lines Detector
The proposed lines detector in the detection subsystem can detect diverse types of lines. The code for lane lines detection for the KITTI dataset can be found on this repository. These types of lines were tested: (1) lane lines, (2) curbs, (3) stop lines, (4) merging or splitting points of two lines (pair of lines), (5) special lane lines or curbs (which are not parallel to the current ego lane). The lines detection problem was dissected by tracing back to the most significant visual feature of the lines, which is their long and narrow appearance. A sliding window was used to follow possible lines. All noise without this narrow feature was eliminated by applying these methods: • Region Restriction: The detection subsystem leverage a given prior knowledge about the starting points to eliminate noise in unrelated regions. This knowledge comes from either previous lines detection results or predicted by the positive detection results of neighbour lines with given lane width from the topology map. For dashed lines, the sliding window moves at a step size of dash segment intervals given from the topology map to make sure optimal detecting position for each segment. The system tolerates minor errors for this interval distance. The more knowledge about the lines are available, the smaller window for detection can be used. A smaller region of interest gives better resilience for challenges, helps the segment normalize better and speeds up the lines detection process.
• Special Convolution Kernel: The system uses a special kernel, as shown in Fig. 5. This proposed kernel helps to produce a cleaner result in the Hough space for the next steps with less noise. Also, this kernel is more friendly for detecting curves, merging lines and splitting lines than the simple vertical kernel.
• Directional Erosion: The system uses a special directional erosion structuring element to erode noise which is not spanning through a specific direction (A B, A is the pixels in the window and B is a 5 by 1 narrow structuring element), as illustrated in Fig. 6. The direction of the target lines is given from the topology map. In a sliding window, the line segment can be considered as a straight line. Sharp turning lines or circles will also be eroded into small segments which will be filtered out. Though there are some other more complicated ways to leverage the information of direction for lines detection [18], the directional erosion is the simplest and it works.
• Types of Lines: The system leverages prior knowledge of the types of the lines to get a better performance. For curves in each detection window, the turning angles are restricted to the thresholds, which is usually very small given from the topology map. For straight lines, a much narrower window for detection can be  used. For dashed lines, the marks which are too long or too short will be filtered out, as shown in Fig. 7. The topology map gives the length of segments of the dashed lines.
The proposed lines detector uses the Y channel from the YUV color channels since it was proved to perform better by Lin in [19]. The system works on the Bird-Eye-View (BEV) space since the prior knowledge of those lines can be leveraged without predicting the camera pose or estimating the vanishing point (VP) [20]. More about the homography transformation from the camera image to a BEV space with a given camera pose can be found in [21].
For the feature detection on the Hough space, a low-highlow kernel was widely used by [22], [23] and [24]. A new lowmiddle-high kernel was used and then mirrored to make the detection on the left and right side separately. So merging and splitting points and their directions (merging from / splitting to the left or the right) can be detected by comparing the lengths of these two lines detection results. For example, at the place a line is splitting to the right, the line detection from the right side will break coming with a shorter length of the line than the left side, as shown in Fig. 8. To separate splitting and merging, two additional windows will be created facing upwards and downwards. Positive result of lines in the upwards window means splitting and positive result in the downwards window means merging.
Lastly, the procedure for stop lines detection is as follows.
After the detection of a window, if the line is broken in the upper end, two side windows will be created. A horizontal line detection, using horizontal convolution kernel and erosion structure, will be applied to detect the stop lines. If the result is positive, then this lane line is marked as finished, and no window will be created above. For special lines which are not parallel to the current ego lines, an initial position for the sliding window to start will not be available to use. However, the system can still use the direction information from the topology map. Spotting the anchors from the target entrance while turning in free spaces is one of the situations which requires detecting special lines, as shown as in Fig. 4. The process is a little different, shown as follows:  lines. The later part of this chapter shows how the localization method helps the vehicle travels through an intersection in the turning scenario.
For lane lines detection, the method was tested on KITTI [15] and Cityscapes [25]. For general traffic lines detection, The proposed method was tested on the Berkeley deep drive (BDD 100k) [26], KITTI and a self-recorded video. These results of general lines detection cannot be compared to other methods due to lacking metrics. At last, the BDD 100k dataset and images from a self-recorded video are used for testing the localization method while passing free spaces.

A. Lane Lines Detection
The proposed lines detector, ECPrior (Erosion and Cluster with prior knowledge), perform as good as other deep neural network supporting approaches [27] [28] [29] [30] based on the KITTI behaviour evaluation [31] metric. The result is shown in Table I. Some of the detection results are shown in the first row of Fig. 9. The proposed detector does not include object detection; hence it will be affected by other cars close to the lines. A typical object detector can be added before to get a better result, like Satzoda did in [32]. The object detection is usually a separate module, and the same feature should not be implemented again in the lines detection module. The proposed lines detector works equally fine on Cityscape showing its scalability, as shown in the second row of Fig. 9, despite they have very different object aspect ratio from the aspect ratio of images from KITTI. The limitations of ECPrior are: • Like all other methods, ECPrior relies on a stable and accurate BEV transformation. The transformation is hard to be accurate when the ground is not flat. Although the deep neural networks can learn to avoid this for a specific dataset, it is still hard to scale over different datasets. When it comes to non-flat surfaces, the width of a lane might shrink, as shown in the first fail case in Fig. 10. Dynamic adjustment of the window width can avoid windows from merging. ECPrior can tolerate minor distortion of the BEV transformation.
• Because ECPrior is for general cases, the input images should not have special manifests which would disturb the detector, as shown in the second fail case in Fig.  10. For KITTI, these manifests are mainly caused by the BEV transformation over low quality areas.

B. General Lines Detection
ECPrior can solve the problem caused by shadows or short breaks for general lines detection. ECPrior is also proved to be robust with different lighting conditions. For stop lines, images from the BDD 100k was used for testing. The result is shown in Fig. 11. The upper case in that image is under a lightly snowing daylight environment, and the lower case in that image is in a night lighting environment. In both cases, ECPrior successfully detects the stop lines ahead.
ECPrior also detects special lines well. A self-recorded video was used for testing. An example in Fig. 12 shows the ability to detect special lines under a turning scenario travelling into a roundabout. In this situation, ECPrior needs to detect the rear inner side of the roundabout. The left side curb of the current lane and the inner side curb of the roundabout can then form a strong anchor used to rebuild the scene of the free space for localization.  ECPrior uses intense erosion and threshold so that only a small portion of target lines will be detected at the pixel level. Hence the ECPrior detector is not a pixel-level detector. ECPrior, as an intact line detection module, provides lines detection result using regression for dash line segments and straight lines and using Spline for the others. ECPrior inevitably relies on an accurate BEV transformation to leverage the prior knowledge of the lines. Distortion due to camera behind the windshield or problematic camera settings also cause a narrower efficient area for general lines detection, at that situation only lines lie in the middle of the front can be detected. As an example, the detector failed to detect the left side of the inner curb due to distortion in Fig. 12.

C. Localization
Based on these results from previous examples, strong and weak anchors can be established to locate the vehicle in the turning scenario. The proposed localization approach relies on neither GPS nor IMU for vehicles to travel through urban areas. The system provides a stable and accurate position based on the rebuilt scene for path planning and control subsystems in the turning scenario. For the cruising scenario, the detection system gives a lane level localization result (which lane the vehicle is on) which is enough for the following subsystems.
There are several limitations for using the naive approach of my proposed system for localization. Firstly, the proposed localization method relies on visual clues of specific traffic features. Heavy occlusion blocking most of the target traffic lines will affect the location result in some degree. In one situation, the vehicle was approaching the intersection with heavy traffic ahead, blocking most of the coming stop lines. The localization system did not spot anchors until when the vehicle was very close to the stop lines, producing a short reaction time to stop for the following subsystems. In another situation, the vehicle was about to turn right into a small allay based on the navigation. Several parking vehicles blocked the view of the right side curb. Hence the detection subsystem did not detect the right turning feature for the allay and make the vehicle miss the target turning.
In the first situation, the behaviours of other vehicles can be exploited as an input for the localization subsystem, like the way Gao leveraged the position of other vehicles in [33]. For example, when the system detects a line of stopping vehicles, it can assume the position of the first stopping car is indicating the position of the stop line to form a prediction to extend the reaction time for the following subsystems. In the second situation, a more comprehensive drivable area analysis will show a right side road extension indicating the allay. Additionally, the localization subsystem is compatible with traffic lights, traffic signs and GPS as pieces of additional information to help.

V. CONCLUSIONS
This paper proposes a new perception centered self-driving system and focuses on testing the proposed general lines detector, ECPrior, and the localization method on several urban cases. The proposed system design is a skeleton and a starting point with all potentials to work with additional modules to get better performance. For example, users can try to apply the method by Hillel in [34] to get rid of the lens flare to make the detection of ECPrior more robust when driving towards the sunshine. The potential is much more promising than other deep neural networks based detection methods. And diverse types of scenes rebuilding can be discussed in future works. Places like indoor parking area without GPS signals will heavily rely on the rebuilt scene to localize the vehicle. Hence they should be prioritized.
In the end, I appeal to the community to reconsider the necessity of using SIFT like visual features for localization, as well as the need for relying on deep neural networks for traffic lines detection in the context of self-driving.