Hotspot Identification Through Pick-Up and Drop-Off Analysis of Ride-Hailing Transport Service

—It is important to extract hotspots in urban traffic networks to improve driver route efficiency. This research aims to identify hotspot pick-up and drop-off (PUDO) areas in ride-hailing transportation services using a clustering approach. However, there are challenges in applying clustering algorithms to trajectory data in the coordinates of the Global Positioning System (GPS). So this research proposes modifications to the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm by considering the radius from the center of the cluster to determine the presence of amenities around the cluster. We used a dataset containing 55,988 trip trajectories of Grab drivers over a two-week period in Jakarta. A preliminary statistical analysis was carried out to understand the distribution of trips. Next, we identify the PUDO point of each trip for use in the clustering analysis. The research explores the various parameters and settings of the clustering method and their impact on the results. The study found that the results obtained from the clustering method are sensitive to parameter selection, including epsilon radius and minimum number of points needed to form a cluster. The optimal cluster with the best parameters (eps: 0.25, minpts: 100) in the pick-up (PU) location analysis produced 17 clusters with the silhouette coefficient of 0.752, while in the drop-off (DO) location there are 18 clusters with a silhouette coefficient of 0.694. Overall, the research highlights the potential of the clustering analysis method for ride-hailing transportation.


A. Background
The rise of application-based transportation services, such as ride-hailing has revolutionized the way people move around in modern cities [1].In the context of transportation services, "pick-up" often denotes the time when a vehicle (such as a taxi, or ride-hailing service) arrives at a designated location to collect passengers or packages.While "drop off" can refer to the time when a vehicle arrives at the destination location, and passengers or packages are left at that location.
The exponential growth in the use of these services has created enormous opportunities to analyze and understand urban mobility patterns [2].In the study of Zhang et al [3], mobility-based data analysis has become the main focus in uncovering complex urban movement patterns.There have been many studies that use taxi data to examine urban mobility, such as Veloso et al [4] use of taxi trajectory data from Lisbon to discuss the spatiotemporal variation of taxi services, correlations between pick-up and drop-off (PUDO) sites, and driver behavior, and Liu et al [5] use of taxi trajectory data from Shanghai to explore human movement patterns.The taxi traces were also mined for characteristics and behaviors of the vehicular network, including anomalous and driver behavior, dynamics of mobility patterns, interactions between vehicles, and the relationship between gender and mobility [6].Additionally, Keler et al [7] examined where automobile routes connect at specific rush hours in urban locations.
By identifying origin and destination (OD) flow clusters in urban travel data, it is possible to determine prospective routes for public transportation service settings [8].In order to locate the taxi OD hotspot, the available OD pairs from empirical mobility traces are first grouped [9].A deep understanding of hotspots, namely areas with heavy travel activity, has great potential for shaping more efficient transportation planning and better traffic management [10].Although there have been previous studies exploring hotspot analysis in urban mobility, Dutta et al [11], use of more sophisticated and comprehensive clustering methods is still an important area of exploration.
Exploring urban mobility patterns has become a main focus in recent years, with previous research analyzing trajectory data [12].Therefore, in this context, it is essential to define several key terms and assumptions that will form the basis of this study before presenting the main theorem.We define a hotspot as a location with a high concentration of PUDO points are adapted from [13].We assume that the dataset used in this study is representative of the overall ride-hailing transportation system in the study area.Additionally, we assume that variables like the time of day, day of the week, and the location of well-known destinations have an impact on drivers' PUDO patterns.

B. Motivation
The majority of previous research on ride-hailing has been on the routing method, with little emphasis paid to optimizing PUDO locations that are sensitive to the spatial and temporal need distribution [14].When taking a shared trip with multiple riders, the PUDO optimization is vital to prevent pointless detours.Whereas in the conventional system, the vehicle is frequently obliged to pick up passengers at certain locations [15].Detours are made in order to pick up additional riders frequently result in longer travel times and higher costs.The PUDO position, where all potential PUDO points must therefore be optimized urgently.[2], suggests a cluster analysis approach for identifying various metropolitan online ride-hailing operation trends.Based on the suggested intensity and stability indicators www.ijacsa.thesai.org of ride-hailing vehicle operational characteristics, k-means++ clustering technique is applied.The results show that there are three distinct operating patterns for online ride-hailing services.In the study of Zhang et al [3], the goal is to identify the distribution of areas with high travel demand as well as the relationship between travel demand and Point of Interest (POIs).Tang [16], utilizes taxi GPS trajectory data to analyze urban human activity and mobility in Harbin city.The researchers employ the DBSCAN algorithm for PUDO location clustering and develop four spatial interaction models to understand pick-up location searching behavior.

Shen et al
However, with the rise of ride-hailing services, there is a need to develop new methods for analyzing PUDO patterns of drivers.Gunawan and Susilawati [17], addresses limitations in current ride-hailing PUDO location selection practices, which often prioritize spatial distribution and company interests over passenger needs.Research using neural networks was carried out by [18], seeks to examine the integration of clustering models and deep learning techniques.The model, which can concurrently capture the spatial and temporal fluctuations of taxi hotspots, was proposed for taxi hotspot prediction.
There have been several studies interested in discussing the applications for the clustering analysis method in transportation systems.Zhang et al [19], used DBSCAN to cluster PUDO data from a ride-hailing service in China.A pick-up points recommendation model (PPRM) is introduced, utilizing DBSCAN to cluster historical orders.This clustering enables finding contextually relevant candidate PU points.The other research by Wang and Ren [20], introduces a two-level divide approach and enhances the K-means++ algorithm to refine the clustering of taxi passenger hot spots based on GPS location data.The method is validated using a week of New York City's green taxi data, demonstrating superior accuracy and comparable time efficiency when compared to traditional Kmeans and DBSCAN methods.
Based on previous research, clustering analysis method has become an increasingly popular approach in the field of data mining and machine learning.Rafiq and McNally [21], use clustering these data points, ride-hailing companies can gain insights into traffic patterns and usage trends of their customers, which can help them optimize their operations and improve their overall service quality.
To build on prior research that examined trajectory data to investigate urban mobility patterns, this study employs a clustering approach to identify hotspot PUDO areas for ridehailing transportation services.The DBSCAN algorithm was chosen due to its ability to identify clusters of varying shapes and sizes.However, it has limitations in handling spatial datasets, therefore, this study proposes modifying the DBSCAN algorithm to take into account the radius from the cluster's center to determine the presence of facilities in the cluster's vicinity.Our method identifies concentrations of PUDO locations that can be used to determine potential hotspots.While this study does not provide a direct comparison with previous methods, the proposed method is superior to previous approaches because it considers the radius from the center of the cluster to determine the presence of amenities around the cluster, uses a large dataset of ride-hailing trajectories.
The paper is structured as follows.Section II describes data processing, and proposed method.The findings and results gleaned from the methodology are discussed in Section III.Section IV concludes by summarizing the work and offering recommendations for the future.

A. Data Collection
The dataset is derived from Grab's food delivery and logistics in Jakarta, and it includes 4,000 daily trajectories collected from 2019-04-08 to 2019-04-21 (inclusive, UTC/ Universal Time Coordinated).The trajectories were gathered from drivers' phones while they were on the road.The total number of GPS pings in the collection is 61.549.964; each GPS ping has values for its trajectory ID, latitude, longitude, timestamp, accuracy level, bearing, and speed [22].The raw data sample is shown in Table I.
The names of the fields in Table I are covered below.

1) Trajectory ID:
A number used to identify different GPS mobility trajectories.
3) Longitude: A GPS location's longitude coordinate.4) Timestamp: Time the GPS locations were recorded is shown by the timestamp.The UTC standard is used in the format.One second is the quickest sample interval for each GPS point.
5) Accuracy level: Shows the radius of the circle that, with a certain probability, contains the real location.
6) Bearing: The degrees relative to true north.

7) Speed:
The immediate speed is expressed in meters per second.

B. Data Preprocessing
The next step is to pre-process the data to create trajectory, the data includes the point of trajectory of captured trips.The trips were divided based on the trajectory ID, and the minimum and maximum trajectory times recorded for each trip were used to determine the PUDO information.The result of dataset contains in total 55,988 trajectories with the distribution of number of trips is show in Fig. 1.
The daily distribution of trips is depicted in Fig. 1.We simply display the relative frequency of travel requests discussed in this article in order to maintain data confidentiality.As seen in Fig. 1(a) pick-up (PU) time, the temporal travel demand is typically distributed over the normal working day.Between 6 P.M and 12 noon, it is relatively high, but after that, the tendency begins to decline until 9 P.M.The trend then went upward once more until 10 P.M.The same trend also occurs in drop-off (DO) time, presented in Fig. 2(b).

C. Proposed Method
The main analysis involves applying the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) method to the preprocessed data to identify clusters of PUDO locations.DBSCAN was introduced by Ester et al [23], which is a clustering algorithm commonly used in data analysis to identify clusters of data points based on their spatial density.
It's particularly useful when dealing with data where clusters might have irregular shapes and varying densities.The algorithm categorizes data points into three main types [24]: core points, border points, and noise points.A core point is a data point that has at least a specified number of neighbor points (minpts) within a certain distance ( ).These points are at the heart of a cluster.A border point is a point that is within the distance of a core point but doesn't have enough neighbors to be considered a core point itself.And finally, noise points are any points that are neither core nor boundary points.
The following steps make up the algorithm [25]: 1) Identify the core points or points visited by more than minpts neighbors by locating all neighbor points within .
2) Make a new cluster for each core point if it hasn't already been done so.
3) Find all points connected to it by density recursively, then group them with the core point in the same cluster.
4) Points and are said to be density connected if point has a significant number of points in its neighbors and both of those points are close to the .Chains are used in this process.Inferring that is a neighbor of , if is a neighbor of , is a neighbor of , and is a neighbor of , then is a neighbor of .
5) Iteratively go over the remaining unexplored points in the dataset.All points that do not form a cluster are considered noise.
In this paper, we modified the DBSCAN algorithm from Kambe and Pe [25] to better suit the ride-hailing context by creating a function called DBSCAN_FIT that takes three parameters: x (dataset), (epsilon, the maximum distance between two points to be considered in the same cluster), and minpts (the minimum number of samples in a cluster).This function utilizes the DBSCAN algorithm to cluster the data and produces visualizations of the clusters, including those labeled as noise (points not belonging to any cluster).Additionally, the function calculates centroids for each cluster and assesses the amenity around each centroid.The pseudocode for the DBSCAN_FIT is shown in Algorithm 1.
To obtain the amenity information with function get_amenity, we used OpenStreetMap (OSM) data to identify the facilities or points of interest around each centroid.Specifically, we used the OSM API to query the database for the amenities within a certain radius of each centroid.We then used the OSM tags to extract the names of the amenities and assigned them as labels to the corresponding clusters.At a given average latitude on Earth, we use the number 111,320 as a conversion factor to translate differences in degrees of latitude or longitude into distances in kilometers.

D. Cluster Performance Measure
To evaluate the quality and coherence of clusters obtained through clustering algorithms, the Silhouette Coefficient (SC) was used.It quantifies the cohesion and separation between clusters.Ranging from -1 to 1, a higher SC indicates welldefined clusters where data points are closer to their own cluster members than to others.A score close to 0 suggests data points on cluster boundaries, while negative scores indicate potential misassignments [26].
A higher SC would indicate well-defined clusters of PUDO points, highlighting their coherence and separation from other clusters.This metric becomes essential in assessing the accuracy of clustering results and validating the effectiveness of algorithms.The silhouette score guides the determination of how accurately identified hotspots represent distinct patterns, thereby enhancing the credibility of the hotspot identification approach in ride-hailing transport services.

A. Experimental Result
Our research focuses on the complex interaction of epsilon ( ) values in the DBSCAN_FIT algorithm in order to do a thorough study of clustering results.To achieve this goal, we conducted a preliminary analysis of the data and found that minpts values below 50 did not produce enough clusters to be useful for hotspot identification, while values above 100 resulted in too many clusters, making it difficult to identify meaningful patterns.Therefore, we chose to focus on minpts values of 50 and 100 in our analysis and visualization, as these values produced the most meaningful results for hotspot identification.In Fig. 2, the outcomes of this investigation are graphically summarized.Fig. 2 The graph on Fig. 2, the graph illustrates the variation in the silhouette coefficient with changing values of .Two curves are depicted, one for a minimum number of points (minpts) set at 50, and the other for minpts set at 100.The results indicate that, for minpts 50, the silhouette coefficient (SC) tends to decrease as increases, suggesting a negative impact on clustering quality.In contrast, for minpts 100, the SC exhibits fluctuations, with some values resulting in higher coefficients.From the graph, it can be inferred that for minpts 100, the SC is more stable and potentially yields better clustering results compared to minpts 50 within a specific range of values.Furthermore, it is evident that the minpts value plays a pivotal role.The minpts 100 configuration consistently outperforms minpts 50, yielding higher SC values.So in this study minpts 100 is used as a reference to find the optimal number of clusters.Compare the value of the SC with the number of clusters formed presented in the Fig. 3.
The number of clusters formed ranging from 3 to 37 presented by Fig. 3.In the context of hotspot identification, a higher number of clusters can provide a more detailed picture of variations in the distribution of hotspots.However, on the other hand, the SC value is also important because it determines the quality and coherence between clusters.So the optimal number of clusters was chosen as 17 clusters for the PU location and 18 clusters for the DO location.So that the map distribution of PUDO locations is illustrated in Fig. 4. The distribution of PUDO locations inside each cluster is shown in the Fig. 5.The cluster number is represented on the x-axis, while the number of trips inside each cluster is represented on the y-axis, in relative frequency.This information is crucial in our effort to locate hotspots for application-based transportation services.These findings are directly related to our study's primary objective, which was to develop a special technique for identifying hotspots by closely analysing PUDO patterns.We have learned a lot from the clustering analysis.With the parameters : 0.25 and minpts: 100, the ideal cluster structure specifically produced 17 clusters for the PU localization analysis, showing an outstanding SC of 0.752.On the other hand, the DO location analysis revealed 18 clusters with a silhouette coefficient of 0.694.These clusters have varied levels of activity, as seen by the bar chart in Fig. 5, indicating the existence of distinct hotspots throughout the transportation network.
Based on Fig. 4 and Fig. 5 show that the highest hotspots for both pick-up and drop-off sites.It follows that these places represent prospective areas that could be used to enhance driver trips.Using the methods we propose, most potential hotspots are presented in Table II along with their amenity labels.

B. Discussion
To assess the performance of clustering, we conducted a comparative analysis with two distinct inpuvts values: 50 and 100, while exploring the influence of the parameter across a range from 0.15 to 0.5.The Silhouette Coefficient (SC) served as our evaluation metric [26].Our findings are consistent with previous studies that have examined the impact of the parameter on DBSCAN clustering [23] [24].For example, [23] found that larger values of can lead to the formation of overly large clusters, which can reduce the effectiveness of the clustering algorithm.Similarly, [24] found that larger values of can lead to a decrease in the quality of the clustering results.Our study builds on these findings by examining the impact of both the parameter and the minpts value on the performance of clustering in the context of ride-hailing services.
One important aspect of the study is the influence of algorithm parameters on the clustering results.The study found that the results obtained from the clustering method are sensitive to parameter selection, including epsilon radius and minimum number of points needed to form a cluster.The study explored the influence of the parameter across a range from 0.15 to 0.5 and found that for minpts 50, the SC tends to decrease as increases, suggesting a negative impact on clustering quality.In contrast, for minpts 100, the SC exhibits fluctuations, with some eps values resulting in higher coefficients.
Although our study does not explicitly mention the most surprising results, we found that our approach of analyzing pick-up and drop-off (PUDO) patterns was effective in identifying hotspots in ride-hailing transport services.Our analysis revealed commuting patterns of users and different hotspots in the transportation network.In this study, several limitations were identified that require careful consideration.Firstly, the dataset utilized in this research was limited to a single ride-hailing service, which may limit the generalizability of the analysis results to other services.Secondly, this study only accounted for factors such as time and location in the PUDO analysis, while other factors such as weather or special events in certain areas may influence PUDO patterns and were not considered in this study.

IV. CONCLUSIONS
Our study reveals that DBSCAN_FIT clustering with 0.25 minpts: 100 and yields 17 clusters for PU locations and 18 clusters for DO locations, demonstrating their potential as www.ijacsa.thesai.orghotspots in ride-hailing services.However, the limitation lies in the trade-off between cluster quantity and quality.A more comprehensive understanding of ride-hailing hotspots is achieved, emphasizing the need for a balance between cluster granularity and silhouette coefficients.Other researchers can use our method of analyzing hotspots through PUDO patterns to find hotspots in other transportation networks.This approach can help improve efficiency and help drivers optimize the routing of transportation services.
Future studies should consider utilizing datasets from multiple ride-hailing services to enhance the generalizability of the analysis results.And, additional factors such as weather or special events in certain areas should be taken into account in the PUDO analysis to provide a more comprehensive understanding of the factors that influence PUDO patterns.

Fig. 1 .
Fig. 1.Distribution average of (a) pick up time and (b) drop off time.