In this weeks Five Question Blitz we focus on data storage systems. In our previous article we covered how data serves as the foundation for every business. This naturally leads into how data storage systems are the mechanism by which data facilitates business operations and analysis.
The Five Question Blitz was created to answer five questions relevant to Data & Analytics. Topics will be broad and answers will be simplified. Our goal is to promote common definitions and increase the general knowledge of individuals with interest in Data & Analytics.
The two are very similar, both store data generated by a business but for very different purposes.
Databases support the day to day transactional operations of a business. That means providing information to facilitate actions such as sales and customer engagement while also recording the outcomes of those actions.
Data warehouses store historical database information structured to support specified analytic and decision making ends. That means information from disparate sources is denormalized, transformed, segmented, and optimized to support analysis, segmentations, forecasting, reporting, and ultimately data driven decisions.
Data warehouses and data lakes are very similar, both store data from disparate sources; however the purposes for each is very different.
Data lakes are unrestricted storage systems that can house raw data without a pre-determined purpose other than data storage. Data formats can include structured, semi-structured, or unstructured. Data lakes are also highly scalable solutions for storing data.
Data warehouses are structured storage systems designed to support a company’s specified analytic end. Data is transformed into a structured system facilitating analytic queries for the purpose of forecasting, segmentation, modeling, reporting, and decision support.
For this question we will define warehouse systems to include both data lakes and data warehouses.
In the simplest terms loading data into a warehouse system means that data must be (E)extracted from a separate data source and then (L)loaded into the warehouse system of choice.
Data warehouses typically include an additional step to (E)extraction and (L)loading; (T)transformation can be required to ensure data formats are consistent within the data warehouses structure.
ETL (Extract Transform Load) and ELT (Extract Load Transform) are the acronyms that refer to this process. ETL is quickly becoming a legacy process in favor of ELT. The main difference is the order of operation. The second and third steps, load and transform, are switched. ELT is more scalable than ETL in its ability to manage large amounts of data and reduce the timeframes for data availability in data warehouses.
A common method for understanding data warehouse structure was created by William Inmon. His method includes four characterizations:
Subject Oriented – This speaks to the ability of a data warehouses to answer questions pertaining to a subject. (i.e. sales questions, marketing questions, product questions, etc.)
Integrated – Data warehouses extract information from disparate sources and must transform data into consistent formats throughout.
Non-volatile – Once data has been loaded into the data warehouse it must be accurate and not change over time.
Time Variant – Data warehouses contain information that ideally spans the lifetime of the businesses operations. This creates the ability for time analysis and the identification of distributions, trends, and patterns.
The most common tool used to access data inside a data warehouse is SQL (Structured Query Language). SQL was created in the 80’s and has been the standardized method for querying data warehouses for the past 30 years. Over the years new SQL variants have emerged with slight syntax differences but at the core SQL is a consistent tool that every Data & Analytics professional needs to know.
Click here to read last weeks article, Five Question Blitz: Deep Dive into Data.
Here are some websites for reference:
About the Author: My name is Ion King and I am the Chief Executive Officer at SimDnA. My focus is on helping others passionate about growing careers in Data Science & Analytics achieve their goals. Connect with me on LinkedIn or find more of my articles on medium.