Optimizing Data Movement in Hybrid Analytic Systems

Please download to get full document.

View again

of 229
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Computer Science

Published:

Views: 9 | Pages: 229

Extension: PDF | Download: 0

Share
Related documents
Description
Hybrid systems for analyzing big data integrate an analytic tool and a dedicated data-management platform, storing data and operating on the data at both components. While hybrid systems have benefits over alternative architectures, in order to be effective, data movement between the two hybrid components must be minimized. Extant hybrid systems either fail to address performance problems stemming from inter-component data movement, or else require the user to explicitly reason about and manage data movement. My work presents the design, implementation, and evaluation of a hybrid analytic system for array-structured data that automatically minimizes data movement between the hybrid components. The proposed research first motivates the need for automatic data-movement minimization in hybrid systems, demonstrating that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to reduce data movement. I then present a prototype hybrid system that automatically minimizes data movement. The exposition includes salient contributions to the research area, including a partial semantic mapping between hybrid components, the adaptation of rewrite-based query transformation techniques to minimize data movement in array-modeled hybrid systems, and empirical evaluation of the approach's utility. Experimental results not only illustrate the hybrid system's overall effectiveness in minimizing data movement, but also illuminate contributions made by various elements of the design.
Tags
Transcript
Optimizing Data Movement in Hybrid Analytic Systems by Patrick Michael Leyshock A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: David Maier, Chair Kristin Tufte Mark P. Jones Christopher Monsere Portland State University 2014 UMI Number: 3670195 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI 3670195 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 © 2014 Patrick Michael Leyshock ii ABSTRACT Hybrid systems for analyzing big data integrate an analytic tool and a dedicated data- management platform, storing data and operating on the data at both components. While hybrid systems have benefits over alternative architectures, in order to be effective, data movement between the two hybrid components must be minimized. Extant hybrid systems either fail to address performance problems stemming from inter-component data movement, or else require the user to reason about and manage data movement. My work presents the design, implementation, and evaluation of a hybrid analytic system for array-structured data that automatically minimizes data movement between the hybrid components. The proposed research first motivates the need for automatic data-movement minimization in hybrid systems, demonstrating that, under workloads whose inputs vary in size, shape, and location, automation is the only practical way to reduce data movement. I then present a prototype hybrid system that automatically minimizes data movement. The exposition includes salient contributions to the research area, including a partial semantic mapping between hybrid components, the adaptation of rewrite-based query transformation techniques to minimize data movement in array-modeled hybrid systems, and empirical evaluation of the approach’s utility. Experimental results not only illustrate the hybrid system’s overall effectiveness in minimizing data movement, but also illuminate contributions made by various elements of the design. i DEDICATION Dedicated to Kate, for always reminding me what is important. ii ACKNOWLEDGEMENTS Thanks to my advisor, David Maier, as well as Kristin Tufte, my committee, and the Portland State University Datalab. I could not have done this work without the support of my parents and friends – thank you all. Brent Dombrowski and Jason Nelson were instrumental in developing Bonneville and Agrios; they have my appreciation. Thanks too to Paradigm4 and the SciDB team, for the opportunity to contribute to their work. My research was funded by the National Science Foundation (Grant #1110917) and Intel’s Science and Technology Center for Big Data. iii TABLE OF CONTENTS Abstract i Dedication ii Acknowledgements iii List of Tables vi List of Figures vii CHAPTER 1: INTRODUCTION 1 CHAPTER 2: ASSUMPTIONS, DEFINITIONS, AND BACKGROUND 16 2.1. Assumptions 16 2.1.1. List of Foundational Observations and Assumptions 16 2.1.2. Discussion 22 2.2. Definitions 23 2.2.1. Agrios-Specific Terms and Concepts 23 2.2.2. Terms and Concepts in Relational Query Processing 29 2.3. Background 36 2.3.1. R, Array Databases, and SciDB 36 2.3.2. Hybrid Analytic Systems 39 2.3.3. Query Optimization 46 2.4. Conclusion 52 CHAPTER 3: AGRIOS’ CONCEPTUAL MODEL 53 3.1. Agrios as Optimizer 56 3.2. Staging 59 3.2.1. Cost Model 59 3.2.2. Transformation Types 61 3.2.3. Search Engine 67 3.3. Discussion 92 3.3.1. Search-Space Representation 92 3.3.2. Particular Refinements 95 3.3.3. Why Bonneville? 96 3.4. Conclusion 97 iv CHAPTER 4: AGRIOS IMPLEMENTATION 98 4.1. Components of Agrios 98 4.1.1. R 98 4.1.2. SciDB 101 4.1.3. Motivation 103 4.2. Agrios as Integration 105 4.2.1. Scope 105 4.2.2. Architecture 118 4.3. Configuration Options 132 4.4. Conclusion 133 CHAPTER 5: THE CASE FOR STAGING 134 5.1. General Overview 134 5.2. Staging Costs 134 5.3. Examination of Plan Costs 142 5.3.1. Overview 142 5.3.2. Methodology 145 5.3.3. Results 149 5.3.4. Additional Considerations 158 5.4. Conclusion 171 CHAPTER 6: EXPERIMENTAL EVALUATION 173 6.1. Overview 173 6.2. Methodology 174 6.3. Results 174 6.3.1. Staging 174 6.3.2. Transformation and Query Rewriting 177 6.3.3. Query Accumulation 181 6.4. Rule Types and Staging 184 6.5. Conclusion 197 CHAPTER 7: CONCLUSION AND FUTURE WORK 199 7.1. Conclusion 199 7.2. Future Work 201 7.2.1. Extensions to Agrios 201 7.2.2. Applications to Other Settings 210 References 213 v LIST OF TABLES 1.1 Inputs into Jane’s analysis 11 3.1 All rules currently implemented in Agrios 64 4.1 R operators currently implemented in Agrios 106 4.2 Data type equivalencies, R and SciDB 112 5.1 Query processing time for three different plans, Query 2 140 5.2 Catalogs used by Agrios’ test queries 150 5.3 Summary statistics for recycling plans, for Queries 1 and 3 164 6.1 Average percentage reduction in data elements moved: all placements (improved placements) 176 6.2 Results comparing Agrios to Greedy, for Queries 1, 2, and 3, “standard” catalogs 181 vi LIST OF FIGURES 1.1 The amount of data moved depends on the computation location 8 1.2 Satellites capturing images of the Earth’s surface 10 2.1 A sample query, represented in tree form 25 2.2 Several arrays with different properties 26 2.3 The amount of data moved during processing depends on where the computation is performed 27 3.1 Agrios is middleware integrating R and SciDB 57 3.2 The architecture and workflow of Agrios 58 3.3 An example of a reductive transformation 61 3.4 An example of a consolidating transformation 62 3.5 An example of how accumulation can reduce data movement without query rewriting 65 3.6 The movement-minimizing plans for the two queries in our example, prior to accumulation 66 3.7 An example of how accumulation and query rewriting can reduce data movement 66 3.8 An enforcer rule at work, in a relational database system 68 3.9 An enforcer rule at work, in Agrios 69 3.10 Plan space is infinite in size, and contains all possible equivalent plans and queries 74 3.11 The initial search space 79 3.12 Search space expansion 79 3.13 Plans are created from queries through the application of implementation rules 80 vii 3.14 Plans are assigned costs, based on the staging, the cost model and facts about the input data objects stored in the catalog 80 3.15 The plan with lowest estimated cost is selected for execution 81 3.16 Depth-first exploration of search space 85-88 3.17 The MEMO structure used by Bonneville to represent search space 94 4.1 An array computation in SciDB 102 4.2 Three instances of a matrix multiplication 106 4.3 A vector containing genomics data 113 4.4 The architecture and workflow of Agrios, reproduced from Chapter 3 119 4.5 An Agrios Abstract Expression Tree, represented as an S3 object in R 126 4.6 Examples of transformation rules 129 4.7 A simple plan 131 5.1 One placement for Query 2 138 5.2 A suboptimal plan for this placement of Query 2 139 5.3 The movement-minimizing plan for this placement of Query 2 139 5.4 The movement-minimizing plan for this placement of Query 3 141 5.5 A suboptimal plan for this placement of Query 3 141 5.6 Queries 1, 2, and 3 147 5.7 A section of staging space for Query 2 148 5.8 Histogram of placements for Query 2 151 5.9 Histogram of stagings for Query 3 154 5.10 Two instances of Query 2 154 viii 5.11 A comparison of the movement-minimizing stagings for the two instances of Query 2 depicted in Figure 5.10 155 5.12 Another perspective on the plan space for Query 1 157 5.13 Normalized costs for all stagings of Query 2 158 5.14 Two instances of Query 2 159 5.15 A comparison of the movement-minimizing stagings for the two instances of Query 2 depicted in Figure 5.14 160 5.16 An example of how unidimensional time-series growth for Query 3 161 5.17 Unidimensional time-series growth of a dataset 162 5.18 Cost comparison for recycled plans, Query 1, time_series_2 165 5.19 Cost comparison for recycled plans, Query 1, time_series_3 166 5.20 Cost comparison for recycled plans, Query 3, time_series_2 167 5.21 Cost comparison for recycled plans, Query 3, time_series_3 168 6.1 Data movement of cost-staged queries compared to naively-staged “do-it-all-at-one-place” queries – Query 1 175 6.2 Staging and query rewriting moves fewer data elements than staging alone 178 6.3 Additional queries used for testing 179 6.4 Additional test result showing how staging and query rewriting moves fewer data elements than staging alone 180 6.5 Query 1, subdivided into subqueries along dotted “cut planes” 182 6.6 Data movement of accumulated queries compared to unaccumulated queries, Query 1, “standard” catalog 183 6.7 Additional queries used for testing, reproduced from Figure 6.3 with cut planes added 184 ix 6.8 Additional test results showing how accumulation, staging, and query rewriting moves fewer data elements than staging and query rewriting alone 185 6.9 Comparison of plan costs by rule type, Query 1, “standard” catalog 189 6.10 Comparison of plan costs by rule type, Query 3, “standard” catalog 190 6.11 Plan costs as a function of optimization time, by rule type 191 6.12 Plan costs as a function of optimization time, by rule type 192 x CHAPTER 1: INTRODUCTION Many of today’s datasets are so large they cannot be adequately analyzed with conventional desktop tools. The size and number of these massive datasets is growing at an increasing rate, in fields as diverse as astronomy, genetics, and engineering. Data scientists are responsible for transforming this data, through analysis, into actionable information. Behind the hype around “Big Data”, there are important research questions to be answered through the analysis of large, disk-resident datasets. This abundance of data, and the desire to analyze it, motivates multiple lines of research. Database researchers develop methods for storing, organizing, and retrieving the data. Their main tool for the job is a database management system. Other researchers from a variety of fields – including statistics and computer science – focus on analyzing the data. Their goal is to extract meaningful information from the data, and their tools of choice are dedicated software systems implementing sophisticated statistical and machine-learning methods. The limitations of these two tool types are exposed when the size of the datasets grow large. Database management systems excel at managing large collections of data, but they are poor tools for performing all but the most rudimentary analytics. Analytic systems are superb at performing complex analyses on small collections of data, but operate unacceptably when the size of the data exceeds the size of main memory. Data scientists have developed workarounds for analyzing massive datasets using their traditional tools. Some resort to sampling the data, others process the data “one bite at a 1 time,” dividing it into main-memory-sized chunks for iterative processing. Though often effective, these ad-hoc solutions are often both slower and more brittle than systems explicitly designed for analyzing big data. When workarounds fail, data scientists must switch to a dedicated big-data analytic system. There are three strategies. In place of his or her existing analysis tool, a data scientist may:  Replace the analytic tool with an analytic system explicitly designed to efficiently manage and analyze big data. Such systems range from traditional relational databases to newer data processing platforms based on a MapReduce processing paradigm.  Install an augmented version of the analytic tool, which has been extended to improve performance on big data through mechanisms such as parallelization or out-of-core libraries.  Adopt a hybrid analytic system, which integrates the analytic tool with a big-data tool, capturing the best of both worlds: sophisticated analytic abilities and functions common to analytic tools plus the data-handling capabilities of big-data systems. Though the first two strategies may bear fruit, it is difficult to believe that a database system such as Postgres can ever be extended to the point where its sophistication rivals an analytic system such as MATLAB, just as it is hard to expect that MATLAB might be augmented to provide the robust data-management features provided by Postgres. The limitations of MapReduce systems – as both an analytic system and a data-management system – are documented [1]. As an analytic system, MapReduce systems are effective primarily only on “embarrassingly parallel” problems, an incomplete subset of common 2 analytic tasks. As a data-management system, the MapReduce stack lacks features associated with data-management best practices, such as schemas and indexes. The hybrid approach is a solid candidate for the best approach: it presents to data scientists a familiar interface with known functionality, while ably handling large disk- resident datasets. The fact that hybrid systems consist of two components, however, raises a problem not faced by the other two approaches. Two fundamental properties of hybrid systems are that: i) data can be stored at both components, and ii) analytic operations can be performed at both components. These properties mean that execution locations of query operations must be specified; a particular specification determines what data moves where. We maintain that this decision-making process should not only be managed, but automatically managed in such a way that data movement is reduced or minimized. This claim motivates the research in this thesis. Our research includes a solution satisfying the demands of this claim. The solution is named Agrios; it is a hybrid analytic system integrating R and SciDB. R is a powerful data-analysis software package, and SciDB is a database management system designed for managing disk-resident array-structured datasets. Agrios integrates these two components, and through the application of techniques pioneered in relational database optimization, automatically minimizes data movement between the hybrid components. My particular contributions include:  Motivating the need for automated minimization of data movement in hybrid systems, through experimental evaluation. The need for automatic minimization versus alternative approaches may not be obvious, so we motivate our solution by exploring the problem space empirically. 3  Theoretical contribution of a partial semantic mapping between the R language and SciDB’s Array Functional Language (AFL). This mapping enables the coupling of the two hybrid components.  Design of a cost-based optimization technique for automatically minimizing data movement between R and SciDB. Our work builds off of proven techniques from database query optimization. These techniques were originally intended for use in databases using a relational data model; we extend, refine, and apply them to a hybrid system that uses an array data model.  Prototype implementation of a hybrid system – named Agrios – constructed using R and SciDB. Agrios is the research platform upon which our experiments are conducted. The platform implements our cost model and optimization-technique designs.  Validation of this hybrid approach, through experimental evaluation. We evaluate our hybrid approach, demonstrating the effectiveness of our optimization techniques. Our experimental work also examines some of the subtler aspects of the optimization process, including the relationship between data movement minimization and optimization time, and the effectiveness of different optimization techniques. Together these contributions advance the state of the art in analytic systems for large, disk-resident datasets. We focus on minimizing data movement for two reasons. First, there is a dearth of research on the topic in the context of hybrid analytic systems. Our work is intended 4 to fill this lacuna. Problems involving data movement, and techniques for resolving the problems, are common to many areas of computer science. The high-performance computing (HPC) community has developed numerous techniques for reducing data movement between computing nodes [2-4]. Researchers in distributed databases have explored optimization techniques and identified new algorithms for reducing data movement between components of distributed relational databases. These techniques for data movement minimization typically assume the homogeneity of computing resources and data models, however, so are not immediately applicable to hybrid systems. Second, data movement is becoming an increasingly significant cost in distributed and hybrid systems. Data movement between processing nodes or hybrid components comes with costs: it takes time and energy. Recent work in high-performance computing shows that time spent moving data between computing nodes often dominates the time spent computing with it [5-6]. Similarly, researchers in energy-efficient computing expect that inter-machine data movement costs will soon rival computation costs for some scientific analyses [7-8]. The growing importance of data movement relative to data processing is in part a result of the growing speed differences between computing hardware and communication hardware. DRAM access in a new server, for example, is already an order of magnitude faster than data access over a 10-Gigabit network connection [5]. Problems involving data movement are exacerbated by the rapid growth of data available for analysis. There is growing interest in converting workaday objects into data- collecting sensors: from phones and laptops to thermostats, toasters, and hot water heaters. In some areas of science and engineering, the growth rates are remarkable: due to new sequencing techniques, the growth rate of genomics data is doubling every five 5 months. Though there are other factors that affect the cost of analysis – computation times at each component, obviously – reducing the cost of data movement is worthy of a dedicated, programmatic effort. There are multiple components of data movement costs, including “time on the wire,” the overhead of setting up and maintaining communication connections, competition with other systems for network bandwidth and database access, and the formatting and restructuring required to map one system’s storage model to another. These costs are an agglomeration of computation and communication costs. Consider the work required to transfer a data object from R to SciDB:  The R object is serialized (computation cost).  Object is written from R’s process space to network buffers (shared computation and communication cost).  Network connection between R and SciDB established, if not already in place (communication cost)  Data is transferred from R to SciDB (communication cost).  Object is copied from network buffer to SciDB process space (shared computation and communication cost).  Object is deserialized at the SciDB master node (computation cost).  Object is sharded and distributed among SciDB worker nodes, as applicable (computation and communication cost). Similar steps apply to moving data from SciDB to R. (The process only gets more costly – and complicated – at greater levels of detail. If the data is compressed before transmission, for example, a compression and decompression step must be added to this 6 workflow.) Given the potential aggregate costs of these tasks, we should find a way to reduce them. There are several strategies for lowering costs, including: 1. Reduce the cost of switching sto
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks