International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2017) Weather Prediction: A novel approach for measuring and analyzing weather data Mr. Sunil Navadia Mr. Pintukumar Yadav Computer engineering dept. SJCEM St. John College of engineering and management Palghar, India [email protected] Computer engineering dept. SJCEM St. John College of engineering and management Palghar, India [email protected] Mr. Jobin Thomas Computer engineering dept. SJCEM St. John College of engineering and management Palghar, India [email protected] Abstract—The generation of data in last few years has increased tremendously and it is expected to increase more in future therefore it is a tedious process to analyze huge chunks of weather data and perform predictive analysis of the same using traditional methods. The project aims to forecast the chances of rainfall by using predictive analysis in Hadoop. The proposed system serves as a tool that takes in the rainfall data from large amount of data as input and predicts the future rainfall with min, max and average rainfall in an efficient manner. Predictive analytic models capture relationships among many factors in a data set to assess risk with a particular set of conditions to assign a score or a weight. These patterns of score/weight found in historical data can be used for predicting the future. Keywords: likelihood; rainfall; dataset; Predictive; weightage. I. INTRODUCTION Weather prediction is the application of technology to predict the weather for a given location based on historical data or current data as applicable. Climate change has been seeking a lot of attention since a long time due to the unexpected changes that occur. There are several limitations in better implementation of weather forecasting as a result it becomes difficult to predict weather short term with efficiency [1]. The prediction of climate has always proven to be very important and useful. Big data collects large volume of data and it is a great challenge for Hadoop, a part of Big Data, which uses Map Reduce and Pig to maintain and process the data and helps to extract useful information in an efficient manner [2]. The Big Data maintains the huge amount of data and processes them efficiently. Big data includes data sets with sizes beyond the ability of commonly used software tools to capture, manage and process the data. We will be using Map reduce and Pig commands in order to analyze the data sets and to perform various operations on the data set. Based on the previous year’s historical weather data set we are able to predict the future weather [3]. Ms. Shakila Shaikh Computer engineering dept. SJCEM St. John College of engineering and management Palghar, India [email protected] II. LITERATURE REVIEW This chapter investigates some researches in the prediction domain we have done. It covers many papers and system which has already implemented in the same field. It also has detail study of each paper in the same field. It covers six papers of prediction analyses. In [4] the author describes design of patient customized healthcare system. It consists of 4 modules. Medical Data Collection Module (MDCM) – It stores big data of patient’s health and medical information in the Hbase. Text Mining Hadoop Module (TMHM) – It analyses the collected unstructured data into structured data like patient’s information, family history and stores the structured data into Hbase with a map-reduce framework. Disease Rule Creation Module (DRCM) – It generates disease rules by using disease information stored in Hbase. Disease Management Prediction Module (DMPM) – This module informs the risk index or result of disease prediction. In [5] the author describes that storm can be predicted using the previous year’s data set. It contains huge number of records therefore can be used as a research idea. This paper defines the solution to predict using Map Reduce Framework. The data is classified using Support Vector Machine (SVM). Using this it can predict maximum Rain Storm. Map Reduce Framework is use for the Rain Storm Prediction. In [6] author describes that it becomes difficult for water supplying agencies to decide on the consumption level of water from the lakes as it isn’t easy to predict the future water levels. This paper focuses on forecasting/predicting the future water level of lakes in order to avoid the situation of scarcity of water. Auto-Regressive Integrated Moving Average (ARIMA) system is used for forecasting and Hadoop is used for handling the Big Data collected from the historical data of lakes. ARIMA model consists of 4 models – (1) Identification of model (2) Model Estimation (3) Diagnostic checking (4) 978-1-5090-3243-3/17/$31.00 ©2017 IEEE 414 International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2017) forecasting. R programming language along with ARIMA model is used to predict the future levels by applying datadriven analytic and data mining concepts. This model is applicable for any time series with various pattern changes that makes it possible to predict approximate level with respect to the future lake level. In [7] author describes that predicting daily behavior of stock market is a serious issue for stock holders. Nowadays the stock market has been called for research in many fields due to its effects on financial challenging. By using linear regression we predict S&P 500 index behavior and at the end we compared and evaluated the result of our proposed method with other approaches. Our System has good performance in terms of huge volume of data and the stock holders can invest more with confidence. By using integrated collective data it can determine market policies and their orientation which finally lead to increases in productivity and income. In [8] author describes that current video streaming algorithms use various estimation approaches to infer the variable bandwidth in cellular networks. This variable bandwidth sometimes leads to reduced quality of experience. There is no accurate bandwidth present due to which achieving reliable video streaming over cellular networks has proven to be difficult. Nowadays most content providers use adaptive bitrate (ABR) streaming. Existing algorithms fail to fully utilize available band-width. Here we are using PBA (Prediction Based Adaptation) algorithm that combines short term predictions. Using PBA we achieve nearly 96% of optimal quality and it also improves the quality of experience by accurate prediction. the probability of predictor given class .P(x) is the prior probability of predictor. The Condition for predicting Rain of our project is as follows: = = = ( ( | = 100) )∗ ( = 100| = 100) ( ( ( | = 0| ( ( ( | | ( = 0) )∗ ( ) (2) ) (3) = 0) = 65 100) )∗ ( = 65 100| = 65 100) ) (4) After getting probabilities of all the parameters if the probabilities of those parameters are greater than or equal to 70% then chances of rain is most likely. If probabilities of those parameters are lesser than or equal from 69% to 50% then there might be rain, otherwise there will be no rain. Thus using the above probability we can predict the future chances of rain. IV. DESIGN AND ANALYSIS This topic includes various information and architecture diagram of our project our project measuring and analysis weather data. It explains the working model of the project. Each block of the diagram is explained in detailed regarding the work it is implementing. It includes blocks like weather dataset, HDFS, Map Reduce block. In [9] author describes that latest technologies and advancements in the field of education has led to the rise of online web-based educational content and assessment. By traditional means of education, prediction of student performance is based on his/her academic report. The research presented here provides an approach (LON-CAPA) to predict the final grade based on the features obtained from the data collected from educational web-based systems. It consists of 2 large databases; first containing educational resources; second containing information of student users, activity details etc. Different classifiers are used to obtain an optimal classifier for classification and Genetic Algorithm is used to improve accuracy of prediction. The GA successfully improves the accuracy of combined classifier performance, about 10 to 12% when compared to non-GA classifier. III. METHODOLOGY In the next version of our project we will use Apache Hadoop Framework and Map Reduce Framework and predict the rain using Naïve Bayes Algorithm. Naïve Bayes Algorithm is a classification technique based on Bayes Theorem. Naïve Bayes is easy to build and very much useful for large datasets. By using the Naïve Bayes equation we can find the future probability [12]. The equation is as follows: ( | )= ( | )∗ ( ) ( ) (1) Where (c/x) is future probability of class(c, target).P(c) is the prior probability of the class .P(x/c) is the likelihood which is Figure 1: Architecture Diagram A system Architecture defines the behavior, Structure and views of the system. An architecture description is a formal description and representation of a system; it supports structures and behavior of the system. A system Architecture can develop system components, the expand systems developed, that will work together to implement the overall system. Similarly we also designed the architecture diagram for our system which has various blocks shown below: 978-1-5090-3243-3/17/$31.00 ©2017 IEEE 415 International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2017) A. Weather Data This Module contains Weather data which will be used for predicting the Rain. It contains various parameters that mean various columns. Data set of our Project is shown below: B. Hadoop Hadoop is open source software and it is used to storing large data set in a distributed computing environment, Hadoop makes it possible to run applications on system with hundreds of hardware nodes. Hadoop supports range of related projects that can extend Hadoop performance [10]. Complimentary software project includes Apache Pig, Apache Hive and Apache Spark etc. Apache Pig is a high level platform for creating programs that runs on Hadoop. Hadoop Distributed file system provides rapid data transfer rates among nodes and in case of node failure it allows the system to continue operating [11]. i. HDFS(Hadoop Distributed file System) The Hadoop Distributed File System (HDFS) is similar to the Google File System (GFS) and it uses large cluster of data and it provides distributed file system, fault-tolerant manner. HDFS follows two architecture which is master and slave. The master node includes a single Name Node that handles the metadata [13]. ii. MapReduce Map Reduce is a framework use for easily writing applications which process big amounts of data on large clusters, fault-tolerant manner. The Map Reduce actually refers to the following two different tasks that Hadoop programs perform: The Map is the first task, which takes input data and converts it into a set of data; here values are broken down into key value pairs. The Reduce task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task [14]. 120 Series1 100 Series2 80 Series3 60 Series4 40 Series5 20 Series6 0 Series7 -20 Series8 Figure 2: Analysis of Maximum Humidity Parameter Grunt>SPLIT maxh1 INTO precpt1 if Precipitation == 0, precpt2 if Precipitation == 60; The output of the above query is use to find the next result, In this we have use the split command to find the entries of Precipitation equal to 0. The result of Precipitation equal to 0 is stored in folder named as Resultnew34. 120 Series1 100 Series2 80 Series3 60 Series4 40 Series5 20 Series6 0 Series7 V. RESULT AND ANALYSIS The analysis and prediction of rain using Apache PIG is done successfully in the first version of project. PIG provides an engine that executes data flows parallel on Hadoop. It includes the language, Pig Latin which is use to Load, Store, Dump data and various other operations can be performed. In our project we have used commands like Split, Load, and Store etc. [15]. Grunt>SPLIT Result3 INTO maxh1 if MaxHumidity == 100,maxh2 if MaxHumidity == 60;After storing the dataset in pig storage we have use split command to find the entries of days where maximum humidity is 100. After getting the result of maximum humidity equal to 100 we have stored the result in specific folder named Resultnew33. The output of the query is given below: Series8 Figure 3: Analysis of Precipitation Parameter Grunt > SPLIT preci1 INTO meanh1 if 65<MeanHumidity and MeanHumidity<100, meanh2 if MeanHumidity ==50 CloudCover == 5, cc2 if CloudCover == 6, cc3 if C10, meanh2 if MeanHumidity == 50; The Output of above result is use to find the next result which is the result of prediction of rain. Split command is use where Mean Humidity is in range of 65-100. The Output is stored in Resultnew35 folder. This folder contains the Result. The Figure below shows the result. 978-1-5090-3243-3/17/$31.00 ©2017 IEEE 416 International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2017) [5] 120 100 Series1 80 Series2 60 Series3 Series4 40 [6] [7] Series5 20 Series6 0 Series7 Series8 Figure 4: Analysis of Mean Humidity Parameter After getting the results we have plotted the graph of chances of rain. In this way we can predict the chances of rain. VI. CONCLUSION Thus we have successfully found of the chances of rain from given dataset using Apache PIG. This was the first version of our project, in next version we will use Naïve Bayes algorithm in Hadoop Framework Apache PIG has some Disadvantages will be overcome in next version of this project. The prediction of earthquake, flood can also be done using Naïve Bayes Algorithm this is the future scope of our project. [8] [9] [10] [11] [12] [13] [14] [15] Mr. C.P Shabariram, Dr. K.E.Kannammal, Mr. T. Manojpraphakar, "Rainfall analysis and rainstorm prediction using MapReduce Framework," Jan. 07 – 09 2016 International Conference on Computer Communication and Informatics (ICCCI) Coimbatore, INDIA,ISSN: 978-1-4673-6680-9/16/$31.00. Prashant Shrivastava, S. Pandiaraj and Dr. J. Jagadeesan, “Big Data Analytics In Forecasting Lakes Levels”, Volume 3, Issue 3, March 2014, International Journal of Application or Innovation in Engineering & Management (IJAIEM), ISSN 2319 – 4847. Farhad Soleimanian Gharehchopogh, Tahmineh Haddadi Bonaband Seyyed Reza Khaze, “ A linear regression approach to prediction of stock market trading volume: a case study” Vol.4, No. 3, September 2013, International Journal of Managing Value and Supply Chains (IJMVSC). Xuan Kelvin Zou, Jeffrey Erman, Vijay Gopalakrishnan, Emir Halepovic, Rittwik Jana, “Can Accurate Predictions Improve Video Streaming in Cellular Networks?”. Behrouz Minaei-Bidgoli, Deborah A. Kashy, Gerd Kortemeyer , William F. Punch, “Predicting student performance: an application of data mining methods with the educational web-based system lon-capa”, November 58, 2003, Boulder, CO 33rd ASEE/IEEE Frontiers in Education Conference,ISSN: 0-7803-7444-4/03/$17.00. http://www.slideshare.net/AdamKawa/hadoop-intheoryandpractice/ Tom White, Hadoop: The Definitive Guide.: O'Reilly Media, Inc., 2012. https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/ Dirk deRoos, Paul C. Zikopoulos, Roman B. Melnyk, Bruce Brown, and Rafael Coss,Hadoopfor Dummies, 3rd ed.,John Wiley & Sons, Inc. Map Reduce: http://en.wikipedia.org/wiki/MapReduce/ Alan Gates, Programming Pig, Copyright © 2011 Yahoo!, Inc. All rights reserved, Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. REFERENCES [1] [2] [3] [4] A. Gautam and P. Bedi, "MR-VSM: Map Reduce based vector Space Model for user profiling-an empirical study on News data," 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, 2015, pp. 355-360. Anjali Gautam, Tulika , Radhika Dhingra, and Punam Bedi, "Use of NoSQL Database for Handling Semi Structured Data: An Empirical Study of News RSS Feeds," in Emerging Research in Computing, Information, Communication and Applications, 2015, in press. Viktor Mayer-Schoenberger & Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think. D Byung,Kwan Lee, EunHee Jeong, , " A Design of a Patientcustomized Healthcare System based on the Hadoop with Text Mining (PHSHT) for an efficient Disease Management and Prediction”, Vol.8, No.8 (2014), pp. 131-150, “International Journal of Software Engineering and Its Applications”,ISSN:1738-9984 IJSEIA. 978-1-5090-3243-3/17/$31.00 ©2017 IEEE 417
1/--страниц