2016 International Conference on Intelligent Transportation, Big Data & Smart City Early Warning of Traffic Accident in Shanghai Based on Large Data set Mining Yang Yanbin, Zhou Lijuan, Leng Mengjun, Sun Ling Shanghai Maritime University, College of Transport &Communications, Shanghai, 201306, China [email protected] Data mining is the process of extracting knowledge from specific forms of data. For specific data, specific issues, choosing one or more algorithms to find the hidden rules of the data, that is implicit and meaningful knowledge, to provide scientific support for decision making. The basic process of data mining is as follows: Abstract—Through the classification and regression analysis on traffic accident statistics in Shanghai from July 2014 to April 2015, the paper puts forward a forecasting model of traffic accident incidences, by which we provides the index system of traffic accident, including month, week, weather and wind speed. Using this model to calculate the range of traffic accident simultaneously. Finally, making decisions and recommendations for controlling traffic accidents and rescue related based on analyzing safe levels, which has important guiding significance to the traffic accident prevention and traffic safety management in our country. A. Data preparation Select the data applicable to data mining applications, the quality of research data, in order to further analyze the preparation, and determine the analytical methods to be carried out. We analyze the main data source of traffic accidents in Shanghai in recent years. In order to data mining more effectively , but also includes a number of relevant data, such as Shanghai's time information, temperature information, weather information, etc.. Keywords- data mining; traffic accident; regression analysis; incidence; safety levels Introduction I. INTRODUCTION According to the global traffic and police department statistics, the number of traffic accidents in the world for about 500 thousand people last year. There are 104 thousand people in China, accounting for 1/5 of the total number of deaths worldwide traffic accidents, ranking first in the world. And a lot of traffic accidents happened because of the unreasonable setting of the road itself, the need is hurry to change the status quo, to reduce the incidence of accidents. At present, the road traffic accident analysis and decision etc. basically in the manual processing stage, and manual processing is the main cause of low efficiency and poor accuracy of decision analysis of the large amount of data traffic accident. Therefore, it is imperative to carry out scientific research and effective improvement on the analysis and decision making of road traffic accidents. But the existing navigation system only for speeding, and monitoring of the high incidence of road sweeping voice prompt to have shortcomings, in view of the road ahead of the drivers prone to defects, improving vigilance on the traffic accidents, the user vigilance, thus reducing the probability of road accidents. This paper makes analysis on whether the various factors of Shanghai traffic accidents influencing traffic accidents. Through the collation of a large initial record of accident data, and screening the influence factors by significance analysis, to comprise the new accident record. The accident rate model was fitted by Lingo, and the influence factors on the traffic accidents rate were derived. II. B. Data reorganization and conversion On the basis of open data of the Shanghai municipal government, using soda data, public data and private data, taking into account the accident data is the government statistics and manual sorting, and is mainly used for the analysis of accident statistics, accident data is incomplete, redundancy and ambiguity, not for data mining algorithm directly, the need for data processing and classification. C. Data mining After cleaning and conversion, the original data of the accident is suitable for mining data sets, data mining on this data set to complete the extraction of knowledge, to find the appropriate knowledge model for decision analysis. For specific data, specific issues, choose one or more data mining algorithm, find the hidden rules, rules and patterns, and provide the solution to the problem. D. Result analysis Interpret the results of data mining and evaluate the results, remove the meaningless part, the meaning of the rules or patterns to analyze again, and ultimately to be easy to understand and identify the way to provide decision makers. III. The goal of data mining is to discover hidden and meaningful knowledge from databases. There are many data mining algorithms and they applies to broad functional areas, which includes classification, estimation and prediction, clustering, association, sequence discovery and characterization. Regression analysis, time series analysis, cluster analysis and others are general methods. ACCIDENT DATA MINING 978-1-5090-6061-0/17 $31.00 © 2017 IEEE DOI 10.1109/ICITBS.2016.149 ROAD TRAFFIC ACCIDENTS DATA MINING IN SHANGHAI 18 For the analysis on Shanghai traffic accident data, considering that this paper is to explore the correlation between Shanghai traffic accidents and various influencing factors, then obtain the probability of road accidents in all circumstances, pointing out specific measures. Therefore, we expand the analysis from the following aspects. West wind north wind northeaster northwester southwester southeaster south wind A. Classification In order to establish a reasonable index system of traffic accidents, nine possible influencing factors are selected out, such as month, week, time, temperature, weather, wind direction, wind speed, whether there is camera and whether the road is smooth. We classify all factors at first, sorting month by 1 to 12 and week by 1 to 7. Time, temperature, weather, wind direction and wind speed according to the following categories respectively. Table 5 Wind speed categories Wind speed grade 3 grade 3-4 grade 3-5 grade 4-5 grade 4-6 Table 1 Time categories Time 0:00-1:59 2:00-3:59 4:00-5:59 6:00-7:59 8:00-9:59 10:00-11:59 12:00-13:59 14:00-15:59 16:00-17:59 18:00-19:59 20:00-21:59 22:00-23:59 2 3 4 5 6 7 8 Reference values 1 2 3 4 5 Through significance testing on the correlation between the accident frequency and the influencing factors, to screen out power factors of accidents. Based on the correlation analysis results, choosing and removing the influence factors of accidents. Finally, seven influencing factors of month, week, time, temperature, weather, wind direction and wind speed are ascertain. Reference values 1 2 3 4 5 6 7 8 9 10 11 12 B. Regression analysis First of all, making data processing on the traffic accidents frequency corresponding to month, and then we knows relation between month and traffic accidents frequency by fitting as follows. Table 2 Temperature categories Temperature -10-0℃ 0-5℃ 5-10℃ 10-15℃ 15-20℃ 20-25℃ 25-30℃ Reference values 1 2 3 4 5 6 7 Table 3 Weather categories Weather heavy rain thundershower moderate rain rainstorm clear shower overcast sleet light rain cloudy Figure 1. relation fitting on month and traffic accidents frequency Reference values 1 2 3 4 5 6 7 8 9 10 From the chart above, the number of accidents in Shanghai occurred at least in February, in the September, October and November occurred more. In February, most people go home for the New Year, the Shanghai traffic volume tends to the lowest, so the number of occurrences are minimum. In the September, October and November, on the one hand because the students term begins, on the other hand due to the National Day holiday, and the vehicles increased, so the number of occurrences also increased and is in line with reality. Based on the analysis of other influencing factors, we can get the conclusion: 1) Week Table 4 Wind direction categories Wind direction east wind Reference values 1 19 The number of accidents on Monday, Thursday and Friday mostly, and also in the first and the last two working days, people are generally become undisciplined, prone to traffic accidents. 2) Time As we know, the number of traffic accidents in the morning and evening peak hours more than other times, that is, more accidents occurs in 6:00-8:00 and 16:00-19:00. 3) Temperature The number of traffic accidents in each temperature range is relatively average, but with the increase of temperature, the number of traffic accidents has increased slowly. 4) Weather Frequent traffic accidents mainly occurs in light rain and cloudy, which makes people listless and inattention. And in heavy rain, rainstorm and other weather, people will be more careful, so the frequency of traffic accidents is few. 5) Wind direction Traffic accidents happen mostly in southeaster, mainly because China is in the east of the Eurasian continent in Pacific, southeast monsoon comes in summer, which also verifies the influence of temperature on frequency of accidents. 6) Wind speed Accidents happens more in three wind speed, and as the wind speed increases, the number of accidents decreased slowly. According to the fitting mentioned above, we can find out the relationship between the number of accidents and the various influence factors, and the model of the number of accidents is obtained as follows: of the regression equation is very good, the regression equation is significant, the regression model is setting up. C. Model of accident occurrence rate According to the relationship between the number of accidents and the various influence factors, we first assume that the relationship between the incidence rate and the influence factors is as follows: Y k1 x13 k 2 x12 k 3 x1 k 4 x 24 k 5 x 23 k 6 x 22 k 7 x 2 k 8 x35 k 9 x34 k10 x33 k11 x32 k12 x3 k14 x 42 k15 x 4 2 6 According to the value we set and the corresponding Y value, we give the constraint conditions: k1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 + k10 + k11 k12 k13 k14 k15 k16 k17 k18 k19 k 20 k 21 1 Month Figure 2. lingo example solution Week Time 0.9363 x2 2.6051 ln y 0.0008 x35 0.0247 x34 0.2828 x33 1.3777 x32 2.4163 x3 5.527 Temperat ure ln y 0.0285 x43 0.43 x42 Weather ln y 1.6416 ln( x5 ) 0.5452 Wind direction ln y 0.0274 x62 0.0879 x6 Wind speed As a result, we get the relationship between the incidence of accidents in Shanghai and the influencing factors. R2 ln y 0.0039 x24 0.0683 x23 0.4083 x22 2.0652 x4 1.1673 1.5834 ln y 0.5349 x73 4.8008 x72 13.243 x7 16.668 0.85 38 （22） 0.94 61 （23） 0.83 95 （24） （25） （26） （27） 0.98 09 0.80 23 0.90 26 （2-9） After processing the data, the coefficients of the function are fitted by lingo, and the results are as follows: Form ula ln y 0.0033x13 0.0642 x12 0.3009 x1 0.87 5.4948 （267 1） Regression equation （2-8） 2 7 k16 ln( x5 ) k17 x k18 x 6 k19 x k 20 x k 21 x 7 Table 6 Regression equation of the influencing factors and the number of accidents Influence factor 3 7 Y 0.4508 10 4 x13 0.1834 10 2 x12 0.02058x1 0.8176 10 2 x24 0.1345 x23 0.7518 x22 1.6172 x2 （2-10） 2 0.01799 ln( x5 ) 0.8932 10 x7 As we can see from the function, there is a greater link between the number of accidents per day, namely the accident rate and the months, weeks, weather and wind speed. Therefore, we select the month, week, weather and wind speed as the 4 factors of the accident rate index system, as shown below: From the table we can see that the of the regression equation is greater than 0.8, indicating that the fitting effect 20 (3) According to the scope of the traffic accident incidence, we put forward the safety level, and provide the corresponding measures and the concept of the volunteer aid station in different safety level. (4) In order to develop the traffic accident rate model better, the classification of the current traffic data need to be more reasonable, in addition to the current traffic accident data, other data such as vehicle mileage, road information and lane number data that could influence traffic accident, we need to collect and improve the modal as soon as possible. Figure 3. Index system of accident occurrence rate In this way, we can calculate the probability of occurrence of traffic accidents according to the month, week, the weather and wind speed. IV. MODEL APPLICATION According to the function that we have obtained, as well as the value of each variable range .We find out the maximum value of the traffic accident rate is 1.2397, the minimum value is 0.9303. That is when on Tuesday January, the weather is cloudy, wind speed at the 4-6 level, the probability of traffic accidents achieve maximum, we should watch rigorously. When on Monday August, the weather is rain, wind speed at the 3 level, the probability of traffic accident reach the minimum instead. A possible reason is that we will be more careful in a rainy day, not prone to traffic accidents, but we also need to remind people to be careful. According to the range of traffic accidents rate, we give the safety level, as shown in the following table: Safety level 7 6 5 4 3 2 1 REFERENCES     Table 7 Safety level classification Range of accident rate 0.9303-0.9746 0.9747-1.0189 1.0190-1.0632 1.0633-1.1075 1.1076-1.1518 1.1519-1.1961 1.1962-1.2400 According to the set of safety levels, we can take the appropriate measures to prevent the occurrence of traffic accidents that can be avoided. For example, at the higher safety level, we can set up the electronic warning system, remind people to be careful with some sharp turns or a large crowd; at the lower safety level, it needs not only the electronic warning, but also needs the corresponding traffic police and other personnel to maintain the traffic situation, in order to avoid the occurrence of traffic accidents. We can analyze the incidence of a certain area of the traffic accident, then as for the "golden 5 minutes" rescue time of the traffic accident, we can set up a volunteer aid station at the right place for every hospital. So we can solve the serious lack of national emergency common sense, but missed the most the effective rescue time problem. V. CONCLUSIONS AND RECOMMENDATIONS (1) Due to the rapid growth of motor vehicles, drivers and road mileage and the rapid development of economy, traffic accidents and casualties and economic losses caused by traffic accidents in Shanghai city have also increased rapidly. (2) Through the analysis of the traffic accident situation and the data of the influence factors in July 2014 - April 2015. We have got the traffic accident rate index system includes four parts: month, week, weather, wind speed, with the application of the index system we can get the rate of traffic accidents. 21 Hayakawa H, Fischbeck P S, Fischhoff B. Traffic accident statistics and risk perceptions in Japan and the United States[J]. Accident Analysis & Prevention, 2000, 32(6):827-35. Evans A W. Estimating transport fatality risk from past accident data[J]. Accident Analysis & Prevention, 2003, 35(4):459-72. Liu Jun, “Traffic accident analysis based on Data Mining Technology” [J]. Transport Information and Safety, 2008, 26(1):7376. (in Chinese) Li Ganshan, “Study on the Traffic Accident Fatality Data in Yunnan Province of China” [J]. China Safety Science Journal, 2007, 17(7):72-80.