Big Data is fashionable. It is interesting to see how the vast majority of people have heard the term Big Data at some time even though they do not belong to the business or technological world.
But it is also very interesting to hear the great variety of definitions that emerge when people are asked about Big Data. In this article I am going to try to solve a doubt that many of you have: What is Big Data?
Big Data is a set of techniques and technologies that allow data analysis.
These techniques and technologies allow us to store, transform, analyze and visualize data efficiently. And thanks to this, we can meet the present analysis needs of the organisations, with a level of demand much higher than a few years ago.
That is, Big Data needs to be used in scenarios where a traditional BI solution (used for data analysis) is not suitable to meet the required analysis objectives.
Big Data should be used in situations where data analysis is not possible efficiently using a traditional Business Intelligence (BI) solution. These situations have historically been associated with what is known as the 3 V’s: volume, velocity and variety. Some people include other V’s in this list such as veracity, volatility, validity, visibility and variability, but the most common definition is still that of the 3 V’s.
Massive data volume means a very high amount of data. When massive data exists, it can no longer be handled efficiently by traditional data repositories. These are, in the vast majority of cases, relational databases, which, despite having evolved in the recent years to be more efficient and to run on more powerful hardware than before, are still a bottleneck for the efficient storage and query of large volumes of data.
The use of this type of storage systems for analysis of large volumes of data can take them beyond the limits for which they were designed, producing a decrease in performance when storing and accessing data. These limits vary depending on the hardware and software, so it is almost impossible to draw a line to delimit the beginning of massive data. A few years ago this limit was of the order of gigabytes, while today, with recent innovations in hardware and software, it’s around a few terabytes.
When someone analyzes data, it does it with the aim of finding an answer to a question, within a timeframe in which that answer will bring some value. If that answer arrives late, it lacks all of its value and the opportunity is gone.
For example, analyzing vehicle and mobile devices location can provide information on traffic flow. In this scenario, the question we want to answer could be: “At what speed are vehicles moving?”. If the vehicle and mobile device data could be obtained and analyzed in a very short timeframe, it would be very useful, since we could visualize the data in a map to offer “updated” information of the traffic density in each road (urban or interurban). However, if this answer is obtained one hour late, it will not be useful to the drivers.
Therefore, it is clear that velocity is a key factor when making decisions.
This velocity to obtain an answer from the data can be broken down into two components: the data loading speed (obtaining, processing and storing) and the speed of information analysis (extraction of knowledge through data analysis techniques such as statistics or artificial intelligence).
If any of these components is slow, there is a risk of exceeding the upper bound for the response time, which will result in no value to the user.
A traditional BI system, due to its design and architecture, has a delay in bringing the data into the repository which usually ranges from a few minutes (in specific cases such as Lambda architectures) to 24 hours (in a scenario of daily data loads), although it could be higher. If we take the previous scenario (traffic), a traditional BI clearly could not satisfy the requirements to have the information updated in near real time.
The traditional data types used to store data are three: numeric, character strings and dates. Historically, when there was a need to analyze data types beyond these, specialized applications were used, which were outside of what are considered BI tools.
For example, for years there were applications and libraries that allowed analyzing images and being able to obtain answers to questions such as “Does a green color appear in the image?” (which could be very useful to know the time elapsed in the growth of a fungus in a laboratory culture). But those applications and libraries were not integrated into traditional BI tools.
Therefore, the analysis of data types beyond the traditional ones was not considered, in the past, as something feasible within a BI solution.
Currently, with the growth of data available in organizations and on the Internet, there is an increasing need to find answers from non-basic data types, including audios, photographs, videos, geolocations, etc. When this is a requirement, the use of Big Data is a must.
Without going into technicalities, the following table tries to summarize the most important differences between traditional BI and Big Data:
|Volume||Few Terabytes||Terabytes and above|
|Velocity||Periodic data loads (typically daily)||Higher frequency in data loads → Real time|
|Variety||Basic data types||Virtually any data type|
|Computation||Centralized in a single computer||Distributed|
|Hardware||High specifications||Any (Commodity hardware)|
|Data quality (veracity)||Very important||Relative importance (a certain degree of uncertainty is assumed)|
Big Data allows us to bring data analysis beyond the traditional BI capabilties. It is a response to the needs of users, just as BI was in the past with respect to older technology. This does not mean that BI should be put aside as a valid alternative when analyzing data. On the contrary, it should always be an option to explore.
However, when the users’ needs include the use of massive data (volume), with responses obtained in a very short timeframe (velocity) or obtained from complex data types (variety), we must discard traditional BI due to its limitations, and go for the use of a solution with Big Data.