Big Data tools are often defined by volume (how much data), speed (speed of data generation), and variety (the variety of data types). Big Data describes data collections of a size that is difficult to process using traditional data management techniques. While many definitions of Big Data tools focus on the aspect of volume, which refers to the order of magnitude of available data, a large amount of data brings with it, in particular, heterogeneous formats and a wide range of possible data sources.
Structured numerical data
Examples are structured numerical data or unstructured data such as text or even images or videos. This diversity and the wide variance of data sources offer many opportunities to gain insights. The latest technical improvements (e.g. cloud computing) make it possible to analyze and store data on a large scale.
For many (new) data types, their exact business value is still unclear and requires systematic research. The available data is often chaotic and even if cleaned up, it can be overwhelming and too complex for even professional data scientists to understand. The contribution of the data is, of course, context-specific and varies depending on the business case and application, chaktty said.
Challenge is to identify the data
One of the biggest challenges is to identify data that best fits the business needs. As a result, too many projects in the Big Data context are also thought of from the wrong perspective: It is not the amount of data itself that determines the questions of insight. It is the question of relevant use cases and the company-related value of the required answers that determine the structuring and integration of the data and the analysis procedure.
Not absolutely necessary in the first step, but important in the further course, is the adaptation of the IT infrastructure to the embedding of analytics solutions and the integration of different data sources. The IT infrastructure consists of the following core layers:
Data ingestion layer
This layer comprises the data transfer from a source system to an analysis environment. It is, therefore, necessary to define a tool and a corresponding process.
Traditional extraction, transformation, loading (ETL) tools, and relational databases are combined with Hadoop/large data sets.
In particular, they cover scenarios caused by less structured, high-volume, or streamed data. Use cases for analysis are based on data from data warehouses up to completely unstructured data.
This range challenges classical architectures and requires adaptable schemas. Which data sources are to be integrated depends on the respective application.
The data value exploration layer
Based on the business requirements and the corresponding use case, data is examined, tested, and sampled in this layer. Depending on the complexity and business issues, a suitable analysis scheme is developed.
Business and explorative analyses, based on OLAP models (Online Analytical Processing) in storage technologies, are supplemented or extended by the use of advanced analysis methods and integration (e.g. R or Python plug-ins).
Data consumption layer
Here the results are used e.g. for visualization. The end-user can use the data without deep technical understanding (e.g. for self-service business intelligence).
Data Thinking: Turning data into values
Companies still have difficulties in using data in a meaningful way or do not think they have the right skills, according to business pally.
However, one of the most important and, above all, the very first challenge in analysis projects is the identification of business needs and the guiding questions that are to define one’s own knowledge gain.
The data thinking approach implies different starting points for the analysis process and different innovation paths, which are usually summarised in three standard situations or scenarios.
These scenarios are determined by the already defined core areas of the analysis requirements business needs, data, analysis, and infrastructure.
Initial situation before data analysis
Different scenarios before Big Data’s data analysis. These four perspectives can be evaluated differently with regard to the level of requirements of the analysis project. Depending on this evaluation matrix, there are inevitably different starting points for analytic projects, according to Techpally.
Based on the experiences from our various customer projects we distinguish three scenarios: In scenario 1, data analysis is motivated by a defined requirement, such as market observation during the rollout of a new website, app, or similar. It was necessary to identify a suitable data source, which is why the missing data does not define the exact analysis and why there is no existing infrastructure with regard to existing data sources or databases.
Ideas have to be developed on which data sources could be relevant and which problems could be solved on this basis. Then, different methods of data analysis are used to generate new insights. In scenario 2, the data source and infrastructure are clearly defined and the specific issues need to be identified. One approach is to assess the insight value of a specific data source that has not yet been analyzed in detail.
For example, a department has an internal database and wants to add a forecasting component to a business intelligence system. In this case, the scope is clearer than in the first scenario and an exploratory data analysis can already be implemented. In scenario 3 there is a precise analytical problem that one would like to professionalize.
Analytical design based
A first analytical design based on available big data tools shows promising results and the solution can be scaled and institutionalized on the infrastructure level in the next step. This requires guidance for architectural decisions, for example, to ensure data quality and integration in further scaling.
In the second part of our article, we will divide the analysis process into four phases and further discuss the importance of the Data Thinking approach also for data analysis of Big Data.