Analysis of a Top-Down Bottom-Up Data Analysis Framework and Software Architecture Design - PDF

Please download to get full document.

View again

of 67
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Games & Puzzles

Published:

Views: 3 | Pages: 67

Extension: PDF | Download: 0

Share
Related documents
Description
Analysis of a Top-Down Bottom-Up Data Analysis Framework and Software Architecture Design Anton Wirsch Working Paper CISL# May 2014 Acknowledgement: Research reported in this publication was supported,
Transcript
Analysis of a Top-Down Bottom-Up Data Analysis Framework and Software Architecture Design Anton Wirsch Working Paper CISL# May 2014 Acknowledgement: Research reported in this publication was supported, in part, by the Charles Stark Draper Laboratory s University Research and Development program. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the Charles Stark Draper Laboratory. Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E Massachusetts Institute of Technology Cambridge, MA 02142 Analysis of a Top-Down Bottom-up Data Analysis Framework and Software Architecture Design by Anton Wirsch B.S. Electronics Engineering Technology (1998) Brigham Young University M.S. Computer Engineering (2004) California State University, Long Beach Submitted to the System Design and Management Program in Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering and Management at the Massachusetts Institute of Technology May Anton Wirsch, All rights reserved The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author: Anton Wirsch System Design and Management Program May, 2014 Certified by: Stuart Madnick John Norris Maguire (1960) Professor of Information Technology, MIT Sloan School of Management & Professor of Engineering Systems, MIT School of Engineering Approved by: Patrick Hale Director System Design and Management Program 1 An Analysis of a Top-Down Bottom-up Framework and Proof of Concept Software Architecture by Anton Wirsch Submitted to the System Design and Management Program in Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering and Management Abstract Data analytics is currently a topic that is popular in academia and in industry. This is one form of bottom-up analysis, where insights are gained by analyzing data. System dynamics is the opposite, a top-down methodology, by gaining insight by analyzing the big picture. The merging of the two methodologies can possibly provide greater insight. What greater insight that can be gained is research that will be required in the future. The focus of this paper will be on the software connections for such a framework and how it can be automated. An analysis of the individual parts of the combined framework will be conducted along with current software tools that may be used. Lastly, a proposed software architecture design will be described. 2 Table of Content Abstract... 2 Table of Content Introduction Motivation Framework Software Architecture and Tools System Dynamics and Data Mining Purpose Summary of Chapters Top-Down Bottom-up Overview Bottom-up Overview Data Mining, Machine Learning Data Mining Flow Top-down Overview System Dynamics System Dynamics Model Creation Method Model Components Information on Creating Models Time Top-Down Bottom-Up Framework Analysis Overview England Riots Forecasting Framework Uses Bottom-up Data Sources Multiple Models Description of Monitoring Framework 3.8 Framework Data Flow Analysis System Dynamics Model Creation Variable Relationships Top-down Bottom-up Interface System Dynamics Output Variables Automated Support for Validation and Tracking Model Forecasts vs. Actual Outcomes (Box 2) Feedback Automated Support for Comparing, Tracking & Balancing Effectiveness of Multiple Models (Box 3) Automated Support for Model Parameter Calibration, Recalibration and Validation for Multiple Locales & Situations (Box1) Crowd Sourcing for Expert Opinion (Bottom-Up Output) Automated Support for Sensitivity Analysis to Infer Behavior Modes and Data Values to be Monitored (Box 4) Controller Framework Analysis Modifications Modifications Riot Example Top-Down and Bottom-Up Software Tools System Dynamic Tools Commercial System Dynamics Tools Commercial System Dynamics Summary Open Source System Dynamics Tools Open Source Tools Summary XMILE System Dynamics Standard Other Modeling Tools Data Mining Tools Commercial Data Mining Tool Commercial Data Mining Tools Summary and Score 4.2.3 Open Source Data Mining Tools Open Source Data Mining Tools Summary and Score PMML Data Mining Standard Data Mining Software Tool Ranking Candidate Tools Top-Down Bottom-Up Software Architecture Previous Software Implementations Software Implementation TD/BU Connection Python Tools Conceptual View Software Architecture Alternative Architectures Python Implementation Java Implementation Stella, ithink, and Powersim Conclusion Reference 1 Introduction 1.1 Motivation In recent years the amount of data that is being generated by people and machines have greatly increased. Buzzwords such as Big Data, Internet of Things, and Machine-to-Machine Communication are commonly heard in mainstream media and indicate how prevalent the topic is. The potential benefit from vast amounts of data is that greater knowledge may be gained by analyzing the data. This type of analysis is a bottom-up approach and many organizations are implementing this approach. A top-down approach starts from general principles and works down to develop models of a process. This thesis investigates an architecture that combines the bottom-up approach with a top-down approach and reviews software tools that can realize the combined architecture. 1.2 Framework A proposed framework of the combined methodologies has been provided, which will be discussed in detail in chapter 3. The framework consists of a top-down module and a bottom-up module along with connections between the two and other blocks. The framework will be analyzed to determine which portions of the proposed framework are applicable and which are not, as well as which portions are capable of automation. The resulting framework will then be used to design a software architecture that can be used to construct the framework. 1.3 Software Architecture and Tools After the analysis of the top-down bottom-up framework, the resulting framework will then be used to design a software architecture. Existing data mining and system dynamics tools will be leveraged to propose a software implantation of the software architecture. The feature set and automation capabilities of data mining and system dynamics tools will be analyzed to determine which of the tools are applicable to the software implementation. 6 1.4 System Dynamics and Data Mining System dynamics and data mining are implementations of top-down and bottom-up approaches respectively. Both are heavily used in business. One example of data mining in business is determining which subset of potential customers to advertise to. A company can analyze their database of customers to determine which types of people are the most common. Knowing this the company can target those types of people for advertisement instead of covering all types. System dynamics is often used to model the policies of a corporation. A simple example will be modifying the inventory policy of a corporation. Various inventory policies can be simulated to see how the change will effect inventory and the overall supply chain over a set period of time. A system dynamics model can be packaged as a flight simulator to allow managers to experiment with adjusting parameters and policies and seeing how the system behaves. The operational methods of the two systems differ. Data mining is used in a live setting where new data is processed on a continuous basis. It is also usually highly automated where there is little to no human interaction required to operate the data mining system. The main use case for system dynamics on the other hand, is for an interactive simulation test environment. A user can set various parameters of the model and then execute a simulation to produce a time-series output. The combined framework and resulting software architecture will be the combination of the two. The framework will operate as an automated system, conduct simulations, and produce a time-series output at a predetermined time interval. 1.5 Purpose While data mining and system dynamics are used in business the combined framework as described here, will not be used for business use. Instead the use case will be to monitor and forecast various events that occur throughout the world. Another use case is to analyze historical events to help understand the important factors of the event. Riots are an example of an event. They occur frequently throughout the world and cause significant damage to a city such as the 2011 England riots. 7 The goal in this example is to see if the combined framework can forecast a riot. This will allow authorities to allocate resources and take action to help prevent the riot or prepare for the riot. Years of research will be required to test this theory. This thesis will provide the framework and software architecture to enable the start of the research. 1.6 Summary of Chapters Chapter 2 will provide an overview of data mining and system dynamics. It will cover their strengths and limitations as well as the process to implement both methods. Chapter 3 will analyze a framework that incorporates top-down and bottom-up methods. It will go over the various parts of the framework and any modifications that were made. Chapter 4 will explore the software tools that are available data mining and system dynamics. A number of commercial and open source tools will be analyzed for their feature set for use in the software architecture. Chapter 5 will discuss the software architecture and data mining and system dynamics tools that can be used for the construction of the software architecture. Chapter 6 will provide a summary of the thesis. 2 Top-Down Bottom-up Overview 2.1 Bottom-up Overview Bottom-up (BU) analysis consists of analyzing various forms of data, such as numbers, text, images, video, voice, etc. in order to find relationships and patterns to gain knowledge from the data. Bottom-up analysis has experienced tremendous growth over the years. This type of analysis has spread to a number of sectors including finance, business, law enforcement, and defense to name a few. Data mining, machine learning, and big data are commonly 8 representatives of bottom-up analysis. This thesis will focus on the use of data mining when referring to bottom-up analysis Data Mining, Machine Learning Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. [1] The above quote provides a simple explanation to data mining. It is a system where data is gathered, stored, and then analyzed in an automated method. One business case example of data mining is to determine if a person will apply for a credit card if provided an advertisement for a card. To understand what types of people are likely to apply for a credit card, credit card companies store attributes of each person that has joined. The attributes can include age, gender, occupation, income, marital status, home address, etc. A number of analytical methods can be used to determine what combination of attributes that a person has will likely apply for their credit card. This analysis portion is where machine learning is implemented. Past data is used to help create an algorithm that learns what combinations provide the highest probability that a person will apply for a credit card. The learned algorithm is called a fit algorithm. Once the fit algorithm has been developed it is used within a data mining system for live operation. Returning to the example, instead of mailing millions of random people credit card applications, a data mining system can parse through a database of noncredit card holders and test the attributes of each person against the fit algorithm. Those that are determined to likely apply for credit cards can be the recipients of a credit card application. Machine learning is a method for learning from data in an automated fashion. The task of determining if an is spam or not will be used as an example to explain machine learning further. A program or model, which contains a number of parameters, can learn if an is spam or not through repeated exposure to spam and nonspam . Each exposure will adjust parameters to improve performance. This is an automate process where model itself is adjusting parameter to improve performance. Once the performance is at an acceptable level the model is considered fit. This thesis will refer to this fit model as the data mining model. 9 The main algorithms used for machine learning are classification, clustering, regression or prediction, and association rule. For classification, the goal is to classify something to a predetermined set of categories. For example if a person is provided an advertisement for a credit card, will the person be likely to sign up or not. Only two possibilities exist for this case. Clustering will group data into similar categories where the number of categories has not been predetermined. Data points, often referred to as records, that are similar are grouped together. A business example of clustering is comparing the amount of income and debt of a set people. Points that are close to each other will be clustered. The figure below shows a plot of the points and the resulting three clusters. Figure 1 Cluster Example [2] Regression will predict a value, such as the price of a house depending on attributes such as the age of the house, number of rooms, neighborhood, etc. By analyzing the price of houses along with their attributes that have sold over the years a model can be created. The model can then be used to predict what the price of a house would sell for based on the attributes of the house. Lastly, association rule determines what objects are usually associated with each other. Super markets are interested in this type of data. They are interested in knowing what other items are usually purchased with hotdogs. 10 These four categories fall into two general groups supervised learning and unsupervised learning. Supervised learning, which includes classification and regression algorithms provide feedback. If a classification of a record is correct or incorrect the feedback can be used for learning. Unsupervised learning, which includes clustering and association rule do not provide any feedback. Therefore learning cannot be gained by processing historical data. For example, clustering will group similar data points but since there is no predetermined number of classification to fall into there is no feedback to learn if the clusters are correct or not. Each of the four categories can be implemented through a number of algorithms. For example classification is possible through decision trees, neural nets, Bayesian classifiers, and Support Vector Machines to list just a few. When a classification machine learning model is being constructed a number of algorithms will be tested to see which will perform the best. The same process is conducted with the other categories as well Data Mining Flow The flow for data mining is described below. Step1: Develop an understanding of the purpose of the data mining project. Is the purpose a one time effort or will it deal with executing countless times. Step 2: Obtain the dataset to be used in the analysis. If the amount of data is extremely large then it may sufficient to randomly sample a portion of the data. A thousand records is usually enough for creating a model [3]. Data may need to be queried from multiple databases internally and externally. Step 3: Explore, clean, and preprocess the data. Data may be plentiful but often it is not clean. Missing records from a dataset is common. A decision must be made on how to handle missing data. It can be ignored or averaged between the surrounding records. Incorrect data is also common. For this case obviously incorrect data can be checked for. For example if the expected value of a record is between two values and the record is outside the range then this record holds incorrect data an can be ignored. 11 Step 4: Reduce and separate the variables. Not all variables may be needed. At this step variables that are not required are removed from analysis. The more variables are included the more CPU time will be required for processing. Therefore it is ideal to keep the number of variables as low as possible. For example, assume that a house has 20 attributes. If all 20 attributes are used to create a fit model then the fit model will have to use all 20 attributes for each record it processes. If the same or slightly less accurate fit model can be created with only 5 attributes then the CPU load will be considerably less. Also, some variables will need to be modified or transformed. For example if a variable is the age of a person the resolution may be too fine. It may be easier to analyze if a number of age ranges were used instead. Lastly when supervised training will be used the data should be split into three groups training, validation, and test. The training set is used to train a model. Once it is trained the validation and test set will be used to see how it performs with a different set of data. Step 5: Choose the data mining task (regression, clustering). Step 6: Use algorithms to perform the task. This will usually take many attempts. Various combinations of variables as well as multiple variants of the same algorithm will be tested. Promising algorithms can be tested with the validation dataset to see how it performs against a fresh set of data. Step 7: Interpret the results of the algorithms. An algorithm from one of the many tested in step 6 needs to be chosen. The chosen algorithm should also be tested against the test dataset to see how it performs with yet another set of new data. At this point the algorithm has been fitted for the task at hand. Step 8: The fit algorithm is integrated to the system for use with real data. The system will execute the fit algorithm against the new records to make a determination such as what 12 classification or clustering does the record belong to or what is the resulting numerical value, or what is the record associated with. Appropriate action for each possible outcome must then be taken. 2.2 Top-down Overview While bottom-up analyzes data to uncover patterns, a top-down method approaches a problem from the high level view a system. For example, in a bottom-up business example a company will look at sales and customer data to extract any patterns. A top-down approach could model a corporation and its strategy. Who are the target customers? What is the supply chain? What is the marketing? How are the departments divided? What are the corporate sales policies? How are sales team incentivized? There are external factors that also need to be considered in this example such as competition from other companies and the overall economy. If all the divisions of the company are not aligned with the corporate strategy then it will be easy to understand that the target sales and customer reach will not be optimal. By starting from the top and then deconstructing the parts and understanding the interactions between the parts one can gain an understanding of what type of customers can be reached and attracted. This thesis will focus on system dynamics (SD) as the implementation of top-down analysis System Dynamics Jay Forrester, who was a professor in the school of management at MIT, developed system dynamics in the 1950s [4]. System dynamics models complex systems and observes the behavior of the model over a period of time. The observations are conducting by executing simulations of the model and visually viewing the output through graphs and charts. Complex systems, where the aggregate relationships among multiple nodes are difficult to
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks