Everyone knows that Big Data contains unstructured data. That is an essential part of the architecture of Big Data. But what happens to Big Data when you put structured data into Big Data? For example, suppose I put click stream data – tens of thousands of repetitive records of click stream data – into Big Data. Or suppose I put hundreds of thousands of repetitive telephone call level detail records into Big Data. Now I have essentially put structured data into Big Data. Do I now have structured data in Big Data?
In the sense that I have thousands of repetitive structured records in Big Data, I do indeed have structured data in Big Data. But in the sense that I have an infrastructure that contains attributes, and rows and columns that are managed by a central dbms, I still do not have structured data in Big Data.
So it just depends on how you look at the data. If my definition of structured data is lots of repetitive data with the same repeating structure, then I do have structured data in Big Data. But if my definition of structured data is data whose infrastructure is controlled by a central infrastructure, then I do not have structured data in Big Data even though I have thousands of repetitive records in Big Data.
So why should it matter? Whether I do or do not have structured data makes a big difference to the analyst who must try to get important information out of Big Data. [Here we make the distinction between a search and an analysis. See previous articles for an elucidation of the differences between the two processes.] If I want to search thousands of repetitive records found in Big Data, I indeed can do searches. But if I want to analyze the thousands of repetitive records of data found in Big Data then I have a harder time. If I want to analyze the data I need context. The only context I have when I load thousands of repetitive records into Big Data is that context that might come with the data at the moment of loading of the data. And I have no assurances from the system that the repetitive data that has been loaded has been loaded correctly, where any relationships of data are maintained and are accurate over time.
So I can do limited amounts of analysis against my “structured” records of data once they have been loaded into Big Data. But I do not have the freedom to do analysis against the data as I might have if the data were being managed by a standard dbms, where there is plenty of context found in the infrastructure itself.
This makes a big difference when considering what data I am operating on. If I have lots and lots of repetitive records, then I may indeed be able to do analytical processing (or at least some limited forms of analytical processing) against the data found in Big Data. But if my unstructured data is not repetitive, then I cannot do any serious analytical processing against data found in Big Data even if it is repetitive.