Exploring Data Warehousing
I had studied Data Warehousing as a subject in my Masters degree. However I'd never expected to come across it in the kind of work I thought I'd be getting into. But viola! It seems I now have an opportunity to try out those theories in practice. I've been given the responsibility to come up with a data warehousing solution for an enterprise software that I'm involved with. Deadline: 2 weeks!
Links
So here are some links I'm exploring at the moment:
Data Warehouse: Wikipedia Article
Designing the Star Schema Database
Data Mart vs Data Warehouse - The Great Debate (a blog entry)
A quick Data Warehouse tutorial
Problem Statement
Straight from my Manager:
"Come up with a data warehousing solution for our TeamBuilder system. It should store timestamped details of all the important business models of our system, ie, Person, People Request, Recruiting Request, Partner, Partner Team, etc. This should enable us to create a reporting framework on top so that we can easily come up with trending or time-based reports. Note, however, that the data warehouse should be usable for current info also so that day-to-day reports are also generated by the same reporting framework. Keep in mind that the reporting requirements are not well-known, ever-changing and cannot be anticipated. Hence, persist all relevant data from the operational database so that any report can be catered to."
Why Data Warehouse?
Why we think a DW can help:
- A DW is, after all, primarily used to store historical data.
- A DW's primary usage is reporting and data analysis; a DW schema is usually de-normalized and hence results in very fast query-processing.
- A DW's schema is usually simple and straight-forward, such that non-techies can themselves get the desired info out of it (using simple SQL or reporting tools) without requiring expert help.
Considerations
Strategic
- Since the problem statement says that we cannot anticipate the reporting requirements, we are saying that we do not know the facts part of the DW but only the dimensions part!! How do you go about designing a DW if this happens?? All the popular DW models like Star Schema, Snowflake Schema, etc, have the facts table at the centre and the dimension tables around it.
- We are not necessarily interested in counts only (eg, the number of Billable people in a particular BU) but also in lists of details (eg, details of all the people in the Pool). DW solutions are more geared towards aggregated data; what solutions exist for detailed level facts instead of aggregated facts?
Tactical
Some points to ponder about:
- Need ETL to bring about consistency among the various representations of the same concept. Glaring example: representation of BU is different in Person, People Request, Recruiting Request, and Partner.
- New custom fields can be added at any time in the operational system. How to deal with that? Solution: Add a new column in DW table each time?
- Custom fields can be renamed at any time. How to deal? Solution: Use a fixed name (like a human-readable ID) for each CF (represented as a column) in the DW table, and have a timestamped mapping between operational name for the CF and the fixed DW name?
- Duplicate CF names are allowed in the operational system (as it uses numeric IDs to differentiate CFs). Solution: Disciplined use of a convention of using only unique CF names in the operational system