Late to the party as usual. So what the blazes is a data lake?
Some quick research basically paints this picture for me:
- Store ALL data in 1 place
- Relational data
- Flat files
- images
- Schema on READ
There are other bullet points but these were, to me, the point ones. The idea is to take all the data in its original form and just store it. Unlike a data warehouse where the data would be transformed in to the warehouse’s schema, in a lake you leave the data as is. The schema gets applied at the time of reading the data.
All of this seems like a pretty cool idea. Data storage today is fast and cheap so why not? I don’t have an answer and don’t see the damage it can cause as of yet. However, I can easily see data lakes turning in to junk drawers if organizations don’t take time to use some governance over what goes in.
I also see issues in the details. How exactly does the platform apply “Schema on read”? What if I want to do a join between Northwind.dbo.Customers and a bunch of jpeg image files of the customers? Are we writing little utilities that do this “schema on read” or is the platform doing it?
I don’t really see data lakes replacing data warehouses. In fact I think they’re complementary ideas.