Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Table of contents
- What is a data warehouse?
- Importance and benefits of data warehouses for enterprises
- Data warehouse architectural design
- Types of data models and dimensions
- Best practices for implementing and managing enterprise data warehouse in 2022
What is a data warehouse?
A data warehouse is defined as a central repository that allows enterprises to store and consolidate business data that is extracted from multiple source systems for the task of historical and trend analysis reporting. The term first surfaced in an IBM paper published in the 1980s and is widely used in the modern-day data landscape.
Unlike data lakes, which offer a way to dump all sorts of raw data (structured and unstructured) for potential future use cases, data warehouses are more purpose-driven, focusing on certain specific subject areas. They keep a massive collection of organized and clean business data in an aggregate summary form (fitting into rows and columns), ready to help with business intelligence (BI) activities, particularly analytics, aimed at meeting a pre-defined business need.
In simple terms, you can describe a data warehouse as a water tank set up to meet the specific clean water (data and analysis) needs of one particular household.
Leading global organizations are using data warehouses, either on-premise or cloud and plugging in BI tools such as Tableau to generate valuable insights for business decisions. A retail company, for instance, can use a data warehouse to answer questions about its best-selling products in the last six months, the customers (demographic) who purchased those products and where they are located. This data is originally stored in different operational and transaction systems such as sales and inventory but the data warehouse aggregates it, serving as the single source of truth across multiple knowledge domains.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
Without the data warehouse, gathering information from legacy databases and applications and using it to generate reports for decision-making becomes a mess.
Importance and benefits of data warehouses for enterprises
- Single source of truth: A data warehouse applies common standards (formatting processes etc. that are developed by the enterprise) to the data collected from different sources. This ensures that all information is high quality and that there are no duplicates or errors that could impact analysis.
- Faster operations: Since the data warehouse consolidates subject area-specific data in a single place with consistency and quality, retrieving data for downstream analytics and reporting becomes simpler, translating into faster decision-making. Without this, enterprise users would be required to log into every individual department system and take the help of the IT department to consolidate data and generate reports.
- Historical data: A data warehouse can keep historical records, which enables users to easily leverage this information to assess changes over time to make present and future decisions.
- Protection from source system updates: As the data is consolidated in the warehouse, source system updates will not lead to loss of data or the knowledge it brings. The information would be untouched in the warehouse.
- Better returns: The improved efficiency in terms of deriving insights and taking decisions also helps with better business outcomes in the long run. Organizations using a data warehouse can quickly identify trends to capitalize on business opportunities — such as changes in product demand based on geographic location — and take necessary actions to improve supply, revenue and profits. Similarly, it can also help cut costs.
- Security and regulatory compliance: A data warehouse can also help ensure security and regulatory compliance with strong levels of data encryption, such as AES256 and features for access control, user authentication and identity management. Early-stage companies find it hard to set up these layers for different application systems.
- Cloud benefits: Initially, data warehouses were built on-premise, which required massive investments in hardware, IT teams and training on all aspects related to the repository. Now, companies have the option of cloud-based data warehouse and switch to a pay-as-you-go model where they would be charged only for the storage and compute used. Everything else would be handled by the cloud provider.
An increasing number of companies are moving to cloud-based EDWs for this reason and the fact that they could instantly upscale or downscale compute capacity according to the demand at hand.
Data warehouse architectural design
While the architecture of a data warehouse can vary according to different organizational needs, most enterprises tend to follow a three-tier system with a bottom, middle and top layer.
- Bottom tier: In this tier, enterprises use extract, transform and load (ETL) tools to extract raw data from operational and external source systems, transform it into clean, high-quality information and then load it into a data warehouse server, which is usually a relational database management system (RDBMS). The information loaded can be categorized into three types: detailed, summarized and metadata.
Essentially, this layer collects information from multiple sources, preps it and puts it together in a single place.
- Middle tier: Once the data is loaded into the data warehouse, the middle tier comes into play with an Online Analytical Processing (OLAP) server. It applies business logic and performs high-speed restructuring and analysis on the stored data in order to give insights — sought through submitted queries. Depending on the database in use, three types of OLAP models can be used: Relational Online Analytical Processing, Multidimensional Online Analytical Processing and Hybrid Online Analytical Processing.
- Top tier: The top tier is the front end or client layer of the architecture, which contains query and reporting tools and enables analysts and other enterprise users to gain the insights they need for their business. It lets users interact with the data, create dashboards and reports, monitor KPIs and build apps, among other things.
Types of data warehouse models and dimensions
When it comes to structuring data for a database, there are two main modeling techniques to consider.
- Relational: Widely used for adding, updating and deleting data in relational databases stored in online transactional processing systems
- Dimensional: Prevalently used in data warehouses or decision support systems with online analytical processing (OLAP) servers
Originally developed by Ralph Kimball, the dimensional modeling technique consists of facts (numerical data) and dimension (descriptions or reference information providing context to the facts) tables. It enables easier reading, summarizing and analysis of the information, thereby allowing users to quickly retrieve data and generate reports.
For instance, when dealing with sales data, the actual quarterly sale number would be a fact while elements providing more context such as customer, product names or sale location would be classified as dimensions. Each dimension can also have its own attributes for quick search and filtering. In the above-mentioned example, the location dimension could have pin codes and states as attributes.
Currently, there can be many types of dimensions in a data warehouse but the ones used most often are the following:
- Slowly changing dimension
- Conformed dimension
- Degenerate dimension
- Junk dimension
- Role-playing dimension
- Static dimension
- Shrunken dimension
In addition to the above-mentioned layers, a data warehouse architecture can also include a staging layer, data marts and sandboxes.
- Staging layer: Since not all information might be available to load into the data warehouse server at once, organizations might set up a staging layer as a temporary data landing zone post ETL. This would ensure all the information is gathered first and simplify the data preparation process, especially for aspects like data consistency.
- Data marts: Organizations might also add data marts to their warehouse architecture — between the central database and end-users — to serve specific business lines. This way, certain pieces of information from the main repository would first go to their respective data mart and then serve the request of users. It would enable users from specific departments and business lines to gain more-focused insights at a faster rate than what would typically be possible when working with a broader, more comprehensive set of data stored in a data warehouse.
- Sandboxes: Data warehouses can also have sandboxes as special experimental zones where enterprise users can experiment with different types of data and analytical techniques, without having to comply with the standard rules and protocol of the data warehouse. It is usually added on to the top tier of the data warehouse architecture.
Best practices for implementing and managing enterprise data warehouse in 2022
Though the cloud is defined as the next frontier in the data warehousing space, it is important to consider what works best for your organization when choosing how to implement a data warehouse.
1. Choose the right infrastructure
A cloud data warehouse, as mentioned earlier, would free up critical resources and provide scalability on a pay-as-you-go model. It would be suitable if you want to avoid spending a lot of money and effort on setting up and running a data warehouse. However, an on-premises infrastructure would give the benefits of additional control as well as strict governance and regulatory compliance at the cost of cloud’s convenience. Setup in its entirety would be necessary as well as appointing staff to continuously manage the data warehouse. In this case, scalability would also not be as smooth.
There is also the hybrid option, which could bring the best of both worlds for different use-cases while providing benefits such as a strong disaster recovery system. So, consider what exactly you need and choose wisely.
2. Train the staff
Moving to the right infrastructure isn’t enough — the company mindset should also shift toward leveraging the technology in use. For this, training and education are critical. An external expert should be brought in to train teams on the benefits, rules and best practices for whatever standard architecture and infrastructure the business has adopted for its data warehouse.
3. Define the data required
As data warehouses are purpose-driven, it is critical to consider business needs and identify the data sources that should be connected to the central repository. nterprises should follow the concept of “think first and load later” and create a data model that details the datasets that will be incorporated into the data warehouse. The relationship between those datasets and the business requirements of the data warehouse.
Once the data sources are identified, make sure to document the location, structure and quality of the information to identify the gaps in it and develop standards and business rules to ensure consistency and quality in the warehouse.
4. Agile approach to delivery with frequent releases
In order to build a successful enterprise data warehouse, organizations should break their project down into phases and start with a subject use case that is of high priority but small and easy to execute. Then, build on it with other subjects and market the results to get the team excited about using the new system to generate insights. The method also provides faster feedback, which could help seamless process and system improvements.
5. Automation for maintenance
You can leverage machine learning to automate technical management functions of the data warehouse and free up resources that go toward maintaining the warehouse. It will directly cut operational costs and ensure that maintenance is done with speed.
6. Plan access control in advance
When a data warehouse brings together information from different systems, it becomes crucial to make sure who gets access to what and at what level. After all, not everyone has the same data needs.
This is where defining access control comes into play. It should be done well in advance and detail can do what. Without clearly defined access controls and a governance policy, users will likely break the data pipeline, particularly in large organizations where many analysts work with the data warehouse at the same time.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.