DATA WAREHOUSE AND BIG DATA INTEGRATION

17 pages
2 views
of 17

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017DATA WAREHOUSE AND BIG DATA INTEGRATION Sonia Ordoñez Salinas…
Transcript
International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017DATA WAREHOUSE AND BIG DATA INTEGRATION Sonia Ordoñez Salinas and Alba Consuelo Nieto Lemus Faculty of Engineering, Distrial F.J.C University, Bogotá, ColombiaABSTRACT Big Data triggered furthered an influx of research and prospective on concepts and processes pertaining previously to the Data Warehouse field. Some conclude that Data Warehouse as such will disappear; others present Big Data as the natural Data Warehouse evolution (perhaps without identifying a clear division between the two); and finally, some others pose a future of convergence, partially exploring the possible integration of both. In this paper, we revise the underlying technological features of Big Data and Data Warehouse, highlighting their differences and areas of convergence. Even when some differences exist, both technologies could (and should) be integrated because they both aim at the same purpose: data exploration and decision making support. We explore some convergence strategies, based on the common elements in both technologies. We present a revision of the state-of-the-art in integration proposals from the point of view of the purpose, methodology, architecture and underlying technology, highlighting the common elements that support both technologies that may serve as a starting point for full integration and we propose a proposal of integration between the two technologies.KEYWORDS Big Data, Data Warehouse, Integration, Hadoop, NoSql, MapReduce, 7V’s, 3C’s, M&G1. INTRODUCTION Information is one of the most valuable resources of an institution, and adequate use to support decision making has become a challenge of ever increasing complexity. Enterprises invest in solutions that allow them to use big data in the best possible way, to generate new business strategies, improve customer service or develop public policies, among many other uses. Nowadays the data volume required to be processed within an enterprise can reach the order of the Exabyte [1]. This poses storage and processing challenges that require new technological solutions that allow not only storage, but also updating, efficient exploitation and that have into account data requirements. This is sometimes referred as the seven V´s [1]: Volume, Variety, Velocity, Veracity, Value, Variability and Viability and the three C´s [1]: Cost, Complexity and Consistency. Given the limitations of traditional techniques used so far and the new data requirements, enterprises face several challenges in managing large volumes of data. The concepts of Data Warehouse and Big Data tend to blend and it is not easy to find a divide between them. While Data Warehouse is a mature management paradigm supported by widespread and wellestablished methodologies [2] [3] [4], Big Data is still a field under development, which seeks to address individual aspects of the problem but still lacks an integral solution. As a result of the art state review, we can conclude that some articles present Big Data as the Data Warehouse replacement, others as Data Warehouse evolution [5], some propose the extension of Data Warehouse to support some Big Data characteristics and others partially explore the possibility of integrating the two.DOI:10.5121/ijcsit.2017.92011International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017This work presents a critical review of the elements that characterize the two technologies and that could allow their convergence in an architectural model that considers the processes of ingestion, pre-processing, validation, storage and analysis of the different data types and data sources that organizations are currently facing. The result of the analysis leads to the conclusion that integration is possible only if the different types of data, their life cycle and treatment are explicitly considered. This integration is materialized in the proposal of a multi-layered architecture model that provides a systematic solution, recurrent in time and not, in an isolated way. This article has been divided into the following sections: section 2 reviews the purpose and scope of the two technologies; section 3 is a review of the methodologies used for the development of Data Warehouse (DW) and Big Data (BD); section 4 reviews architectural models for DW and BD from the point of view of the sources, ETL processes, storage, processing and associated technologies; in section 5, are discussed the characteristics describing one and the other; section 6 refers to the Multilayer Staggered Architecture Model for Big Data, and finally section 7, conclusions.2. PURPOSE AND SCOPE OF BIG DATA AND DATA WAREHOUSE Data Warehouse emerged in the 80's as an alternative for storing and organizing data in a consolidated and integrated manner, allowing users to perform statistical analysis and business intelligence. The term Big Data was coined in 1997 by Michael Cox and David Ellsworth [6], NASA researchers who had to work with generally very big data sets, which overloaded the principal memory, local disc and remote disc capacity. They called this the Big Data problem. Despite being so widely referenced today; Big Data doesn’t have a rigorous and agreed upon definition. It is usually associated with the treatment of massive data, extracted from different sources and without predefined structures. For some authors, Big Data is nothing more than a data set which size is beyond the typical databases tools to capture, store, manage and analyze. Unlike Data Warehouse, Big Data goes beyond information consolidation because it is used mainly for the storage and processing of any type and volume of data with a volume that potentially grows exponentially. Nevertheless, what is concluded in this paper is that both Data Warehouse and Big Data have a common ultimate, goal: data exploration with the purpose of describing situations, behaviours, look for patterns, relationships and inferences. Data Warehouse has as a principle the integration and consolidation of the information in a rigid multidimensional structure. One example is the snowflake model [2] [3], used to do Online Analytical Processing (OLAP) [7] applying Business Intelligence (BI). On the other hand, Big Data does not have as a principle the consolidation and integration under predefined structures, it is more about the storage and management of large volumes of raw data (of types, sources and heterogeneous arrival speeds [69]), for which a distributed infrastructure and a set of specialized hardware and software is required. The processing and data analysis use advanced techniques of data science, in which consolidation is irrelevant, as this depends on the nature of the data and the particular problem.3. METHODOLOGY A fact that motivates this analysis must do with how projects associated with Data Warehouse and Data Big develop. While there are methodologies for the development of projects with DW that are widely used, such as the life cycle of data warehouse of Kimball [2], the point methodology of Todman [3] and the flow model enterprise reference of Inmon [4], these have fallen short when predicting exponential growth and the changing nature of the data because great efforts are required to modify or include new requirements. Some less known methodologies 2International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017propose to include heterogeneous data types, such as streams and geo-referenced data for the multidimensional modelling [9], but they still do not cover the entire life cycle of DW. Data Warehouse is considered a mature technology, widely supported in the research field and with proved results at the organizational level in multiple business contexts. BD does not yet have a standardized and widespread terminology [10]; this problem is being addressed by the standardization group NIST Big Data Working Group (NBD-WG) [11] whose results have not yet been published. Big Data is newer than DW and there are not still standardized proposals for its development. It is presumed it will be necessary, besides resolving the same problems and challenges present in Data Warehouse building methodologies, to consider the development life cycle, non-structured data, heterogenic data sources and no transactional data in general, as well as a fast adaptation to change. Currently the projects seeking to extract added value from the data must consider the V's and C's characteristics mentioned before. They can result to be complex and it is therefore necessary to adopt management strategies for their development, maintenance and production support. Governance policies should be established to reach agreements, create communication mechanisms between different actors (internal and external) and include adaptation to change, management standards, control restrictions and adoption of best practices throughout the life cycle of a Big Data project and general management metadata. Additional to the V`s and the C`s, Management and Governance (M&G) characteristics should be considered (see figure 1).Figure 1. Big Data characteristics (Source: Author)C's and V's characteristics are explicitly made evident for Big Data, even when no methodologies for its development and no integrated framework capable of systematically solving any requirement exist (independently of the knowledge and expertise of the user). For the traditional DWs, V’s and C’s characteristics are not yet explicitly evident, and even when are already considered in software suites for technological development (tools), the methodologies do not contemplate the role they play in the development life cycle. There are certainly incentives for the integration of Big Data and Data Warehouse in one unique solution, but so far the definition of new technologies capable of handling the architecture, processing and data analysis is required.4. ARCHITECTURE From the point of view of the logical abstraction of architecture, both DW and BD have the same components: Data Sources, Extraction, Transformation and Loading processes (ETL), storage, processing and analysis. Because of this, an overview of the architecture in terms of these components is presented below.3International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 20174.1. Sources and data types 4GL technologies (fourth generation languages) facilitated the development of transactional applications that allowed the automation algorithms on repetitive structured data. Structured data (SD) is characterized for being well defined, predictable and soundly handled by an elaborated infrastructure [12]. Technological developments, digitization, hyperconnected devices, and social networks, among other enablers, brought unstructured information to the scope of enterprises. This includes information in digital documents, data coming from autonomous devices (sensors, cameras, scanners, etc.), and semi-structured data from web sites, social media, emails, etc. Unstructured data (USD) don’t have a predictable and computer recognizable structure, and may be divided into repetitive and non-repetitive data [12]. Unstructured repetitive data (US-RD) are data that occur in many occasions in time, may have a similar structure, are generally massive, and not always have a value for analysis. Samples or portions of these data can be utilized. Because of its repetitive nature, processing algorithms are susceptible of repetition and reutilization. A typical example of this category is data from sensors, where the objective is the analysis of the signal and for which specific algorithms are defined. Unstructured unrepetitive data (US-URD) have varying data structures, which implies that the algorithms are not reusable (and the task of predicting or describing its structure is already a complex one). Inmon places elements of textual nature (that require techniques from Natural Language Processing and computational linguistics) inside this category [12]. From our perspective, besides free-form text, imagery, video and audio also pertain to this category. Traditional DWs were born with the purpose of integrating structured data coming from transactional sources and to count with historical information that is supported by OLAP-based analysis. With the upcoming of new data types, some authors propose DW to adapt its architecture and processes, as suggested in Inmon with DW2.0 [13] and Kimball in The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics [14].4.2. Extract-Transform-Load processes (ETL) Construction of Data Warehouse requires Extraction, Transformation and Loading processes (ETL). These must consider several data quality related issues, as for instance duplicated data, possible data inconsistency, high risk in data quality, garbage data, creation of new variables using transformations, etc. That raises the need of specific processes to extract enough and necessary information from the sources and implementing processes for cleansing, transformation, aggregation, classification and estimation tasks. All these, besides the utilization of different tools for the different ETL processes, can result in fragmented metadata, inconsistent results, rigid models of relational or multidimensional data, and thus lack of flexibility to perform generic analysis and changes [15]. Thus, the need of more flexible ETL processes and improved performance gave birth to proposals such as real time loading instead of batch loading [16]. Middleware, for instance the engine of flow analysis, was also introduced. This engine makes a detailed exploration of incoming data (identifies atypical patterns and outliers) before it can be integrated into the cellar. On the same line is the Operational Data Storage (ODS) [17], that proposes a volatile temporal storage to integrate data from different sources before storing it in the cellar. The work presented in [18], unlike the traditional architectures, creates a ETL subsystem in real time and a periodic ETL process. Periodic ETL refers to the periodic importation in batch from the data sources and the ETL in real time. Using Change Data Capture (CDC) tools, changes in the data sources are automatically detected and loaded inside the area in real time. When the system identifies that 4International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017certain conditions are met, data are loaded in batch into the cellar. The stored part can be then divided in real time and static area. Specialized queries for sophisticated analyses are made about storage in real time. Static data are equivalent to DW and historical queries are thus handled in the traditional way. It`s worth to mention that for DW some changes have been observed, including temporal processing areas and individualized processes according to the data access opportunity. 4.2.1. ETL requirements for new data With the need to manage unstructured repetitive data (US-RD) and unstructured unrepetitive data (US-URD) coming from diverse sources (like the previously mentioned) new requirements [73] were raised, among which we may count the following: ••••Managing exponential data growth. DW is characterized for using transactional databases of the organization as the main source, eventually flat files, and legacy systems. Although this data volume grows, it does so at a manageable pace. The new DW and BD provide solutions to the management of large data collection by the MapReduce programming model [19], which allows to parallelize the process for then gather the partial results; all this supported in a distributed file system like Hadoop Distributed File System (HDFS) [36]. The frequency of arrival. This can range from periodic updates to bursts of information. Traditional DWs does not face this problem, since it always focuses on data that can be extracted or loaded in a periodic and programmed way. Since BD was designed to receive all the coming information at any moment in time, it must use any required amount of memory, storage and processing. Longevity, frequency and opportunity of use. Statistically, the most recently generated dataset will be used more frequently and in real time. As datasets become old, new data are added and thus the frequency of use may decrease. But old data cannot be ruled out because it can be always used for historical analysis [12] [20]. Integration of data. While the traditional DW was intended to integrate data across a multidimensional model, the appearance of unstructured repetitive data (US-RD) raised problems related to find adequate ways to group the data under a context independent of the data type. For example, to group pictures with dialogues, even within the same type of data (to determine the context of an image by the same image) to find the structure that best represents all the data and do algorithms to integrate, transform and represent such data[70]. With unstructured unrepetitive data, in addition to identifying the context and structure, an algorithm for each dataset may be required, which prevents reuse and increases complexity [71]. The integration of heterogeneous datasets may be the main difference between DW and BD. In DW, the underlying purpose of integration is to have a global and uniform vision of the organization, while in BD, integration is not the ultimate goal. For BD, some unstructured datasets not amenable to integration should be kept in raw format, allowing the possibility of further uses that may be not foreseeable now.4.2.2. ETL for new data life cycle Seeking response to the characteristics of the new data, Inmon´s proposal in DW2.0[13] defines the life cycle of the data for BD and proposes three storage and processing sectors. First is the interactive sector where most of new data resides, the update is done online and a high response time in performance is required. Second, the integration sector where the interactive sector data is integrated and transformed. In this sector, data can remain longer depending on the needs of the organization. And third, the archive sector, which maintains historical data and has a lower access probability. Similarly, Kimball [20] presents the data highway, consisting of 5 caches arranged per the frequency and longevity of data: a) raw and immediate use data; b) real-time 5International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 2, April 2017data and frequency of use in seconds; c) data for business monitoring and frequency of use in minutes; d) data to generate reports for business decisions and frequency of use in hours and e) aggregated data to support historical analysis and frequency of daily, monthly and annual use. Mayer`s proposal [21] defines a reference architecture based in components that allow handling of all kind of data: acquisition, cleaning, integration, identification, analysis and management of data quality. It also includes transversal components for data storage, metadata, lifecycle and security handling. As for products, already in the market, it`s worth to mention the following. Oracle [22] implements a global proposal including both structured and unstructured data and defines different storage areas where the lifecycle of the data is done in a similar way as proposed by Inmon and Kimball. The proposal relies in a set of tools that permit data gathering, organization, analysis and visualization. SAP Data Warehousing offers a solution that integrates features of BD and DW in real time allowing the analysis and identification of patterns in complex structures of both structured and unstructured data. SAP supports ETL processes in the S
Related Search
Advertisement
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x