DATA LAKE ARCHITECTURE

Data Lake Architecture

Big Data started out as a replacement for data warehouse. The Big Data vendors are loathe to mention this fact today. But if you were around in the early days of Big Data, one of the central topics that was discussed was – if you have Big Data do you need a data warehouse? From a marketing standpoint Big Data was sold as a replacement to a data warehouse. With Big Data you were free from all that messy stuff that data warehouse architects were doing.

Much to the surprise of the Big Data vendors the support for data warehousing was far, far stronger than they had ever imagined. There were (and still are) valid reasons why data warehouses existed. If you wanted integrated, believable data you needed a data warehouse. Big Data had nothing to say about this aspect of data. The vendor just said – “Buy my product and your problems go away.”

So the Big Data vendors got feedback and pushback. Someone decided that Big Data needed an architectural construct. So the Big Data vendors came up with the Data Lake.

Now data lake was never a real architecture. It was just a buzzword that was used to counter the technicians that had already built a data warehouse.

The data lake was just a big collection of data that was thrown onto the Big Data infrastructure. The theory was that you put the data out there into the data lake and people and data scientists were supposed to find the data and use it to solve previously unknown problems. But it didn’t take long for people to discover that the data lake was really just a glorious data garbage dump. The data sat there and no one used it. No one could use it.

The problem with garbage dumps is that they start to smell over time. Furthermore, the Big Data garbage dump was expensive. So the people that put this non architecture out there were called upon to make their garbage dump useful. The first thing that they discovered was that they needed metadata to describe what was in the data lake. Without metadata they discovered that they couldn’t find any thing.

The next discovery they made is that finding data is not enough. They need data that they can rely upon. They need to create analysis from the data lake and trying to connect data from disparate sources is not easy to do. They discovered that metadata wasn’t the solution. They discovered that metadata only led them to the next step up the ladder. Then after the discovery that metadata is merely the first step up a long ladder, they discovered that in order to make sense of data from one analysis to the next they need to refine the data against a common data model. This is the next step up the ladder.

I don’t know if this path the data lake is walking sounds familiar. But it is the same path that the people doing analysis a long time ago have already discovered. They need a data warehouse architecture. Exactly 180 degrees the opposite of what they promised their buyers years ago. (It is hard for vendors to admit they made a mistake.)

When you bring in the data lake, if you don’t want the data lake turning into a garbage dump, you have got to impose the discipline of the data warehouse architecture over the data lake. Stated differently, the data lake doesn’t solve problems, it merely introduces them.

Yes Big Data and Data Lake enthusiasts – there is such a thing as architecture. There is a need to have integrity of data. Integrity of data just does not magically happen. It requires a lot of work and forethought. And we in the world of data warehousing get to say to you – “I told you so.” Yes we do remember who was so condescending to us in years past. We remember who called us an “old” idea and architecture. We remember the derision that was tossed at us. We remember who sold the IT community on the fact that we weren’t needed. We remember when we were told that we were yesterday’s news and to get lost.

You could have saved a lot of time and energy (and money) by trying to build on the past rather than try to sweep us away. We remember.

The last laugh is really the best laugh. But Big Data and Data Lakes did not need to be the problem child that they are. It is the arrogance of the vendor to blame for the mess that has been made.

This post originally appeared on Bill Inmon’s LinkedIn Page.

Bill Inmon is an author located in Colorado. Bill’s latest books include DATA ARCHITECTURE: SECOND EDITION, Elsevier Press, TURNING TEXT INTO GOLD, Technics Publications, HEARING THE VOICE OF YOUR CUSTOMER, Technics Publications, and THE UNIFIED STAR SCHEMA, Technics Publications. Bill was named by ComputerWorld as one of the ten most influential people in the history of computing.

Photo courtesy of Philippe Donn.