“Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”
By Dan Ariely.
Big Data is a term that describes the large volume of data, both structured and unstructured, that flood businesses every day. But it is not the amount of data that is important. What matters with Big Data is what organizations do with data. Big Data can be analyzed to obtain ideas that lead to better decisions and strategic business movements.
What is Big Data?
When we talk about Big Data we refer to data sets or combinations of data sets whose size, complexity and speed of growth make it difficult to capture, manage, process or analyze them using conventional technologies and tools, such as as relational databases and conventional statistics or visualization packages, within the time necessary for them to be useful.
Although the size used to determine whether a given data set is considered Big Data is not firmly defined and continues to change over time, most analysts and practitioners currently refer to data sets ranging from 30-50 Terabytes to several Petabytes.
The complex nature of Big Data is mainly due to the unstructured nature of much of the data generated by modern technologies, such as web logs, radio frequency identification, sensors incorporated in devices, machinery, vehicles , Internet searches, social networks like Facebook, laptops, smart phones and other mobile phones, GPS devices and call center records.
In most cases, in order to effectively use Big Data, it must be combined with structured data of a more conventional commercial application, such as an ERP or a CRM.
Why is Big Data so important?
What makes Big Data so useful for many companies is the fact that it provides answers to many questions that companies did not even know they had. In other words, it provides a point of reference. With such a large amount of information, the data can be molded or tested in whatever way the company considers appropriate. By doing so, organizations are able to identify problems in a more understandable way.
The collection of large amounts of data and the search for trends within the data allow companies to move much more quickly, smoothly and efficiently. It also allows them to eliminate problem areas before problems end their benefits or reputation.
Big Data analysis helps organizations take advantage of their data and use it to identify new opportunities. Thus, it leads to smarter business movements, more efficient operations, higher profits and happier customers. The most successful companies with Big Data achieve value in the following ways:
- Cost reduction. Large data technologies, such as Hadoop and cloud-based analysis, provide significant cost advantages when it comes to storing large amounts of data, in addition to identifying more efficient ways of doing business.
- Faster, better decision making process. With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new data sources, companies can analyze information immediately and make decisions based on what they have learned.
- New products and services. With the ability to measure the needs of customers and satisfaction through analysis comes the power to give customers what they want. With the Big Data analytics, more companies are creating new products to meet the needs of customers.
- Tourism: Keeping customers happy is key to the tourism industry, but customer satisfaction can be difficult to measure, especially at the right time. Resorts and casinos, for example, only have a small opportunity to turn around a bad customer experience. The Big Data analysis offers these companies the ability to collect customer data, apply analysis and immediately identify potential problems before it is too late.
- Health care: Big Data appears in large quantities in the healthcare industry. Patient records, health plans, insurance information and other types of information can be difficult to manage, but they are full of key information once the analytics are applied. That’s why data analysis technology is so important for health care. By analyzing large amounts of information – both structured and unstructured – quickly, diagnoses or treatment options can be provided almost immediately.
- Administration: Management faces a big challenge: maintaining quality and productivity with tight budgets. This is particularly problematic with regard to justice. Technology streamlines operations while giving management a more holistic view of the activity.
- Retail: Customer service has evolved in recent years, as smarter buyers expect retailers to understand exactly what they need, when they need it. Big Data helps retailers meet those demands. Armed with endless amounts of data from customer loyalty programs, buying habits and other sources, retailers not only have a deep understanding of their customers, but they can also predict trends, recommend new products and increase profitability.
- Manufacturing companies: These deploy sensors on their products to receive telemetry data. Sometimes this is used to offer communications, security and navigation services. This telemetry also reveals usage patterns, failure rates and other product improvement opportunities that can reduce development and assembly costs.
- Advertising: The proliferation of smartphones and other GPS devices offers advertisers the opportunity to address consumers when they are near a store, a cafeteria or a restaurant. This opens up new revenue for service providers and offers many companies the opportunity to get new prospects.
Challenges of data quality in Big Data
The special characteristics of Big Data mean that its data quality faces multiple challenges. These are known as 5 V’s: Volume, Speed, Variety, Veracity and Value, which define the problem of Big Data.
These 5 characteristics of big data cause companies to have problems to extract real and high quality data from data sets that are so massive, changing and complicated.
Until the arrival of Big Data, through ETL we could load the structured information that we had stored in our ERP and CRM system, for example. But now, we can upload additional information that is no longer within the domains of the company: comments or likes in social networks, results of marketing campaigns, third-party statistics, etc. All these data offer us information that helps us to know if our products or services are working well or on the contrary they are having problems.
Some challenges that the data quality of Big Data faces are:
Many sources and types of data
With so many sources, data types and complex structures, the difficulty of data integration increases.
Big data data sources are very broad:
- Internet and mobile data.
- Internet Data of Things.
- Sectoral data collected by specialized companies.
- Experimental data.
And the data types are also:
- Unstructured data types: documents, videos, audios, etc.
- Semi-structured data types: software, spreadsheets, reports.
Tremendous volume of data
As we have already seen, the volume of data is enormous, and that complicates the execution of a data quality process within a reasonable time.
It is difficult to collect, clean, integrate and obtain high quality data quickly. It takes a lot of time to transform unstructured types into structured types and process that data.
A lot of volatility
The data changes quickly and that makes them have a very short validity. To solve it we need a very high processing power.
If we do not do it well, the processing and analysis based on these data can produce erroneous conclusions, which can lead to mistakes in decision making.
There are no unified data quality standards
In 1987, the International Organization for Standardization (ISO) published the ISO 9000 standards to guarantee the quality of products and services. However, the study of data quality standards did not begin until the 1990s, and it was not until 2011 that ISO published the ISO 8000 data quality standards.
These standards need to mature and be perfected. In addition, research on the data quality of big data has started recently and there are hardly any results.
The data quality of big data is key, not only to obtain competitive advantages but also prevent us from incurring serious strategic and operational errors based on erroneous data with consequences that can become very serious.
How to build a Data Governance plan in Big data
Governance means making sure that data is authorized, organized and with the necessary user permissions in a database, with the least possible number of errors, while maintaining privacy and security.
This does not seem an easy balance to achieve, especially when the reality of where and how the data is hosted and processed is in constant motion.
Below, we will see some recommended steps when creating a Data Governance plan in Big Data.
Granular Data Access and Authorization
You can not have an effective data governance without granular controls.
These granular controls can be achieved through access control expressions. These expressions use Boolean grouping and logic to control flexible data access and authorization, with role-based permissions and visibility settings.
At the lowest level, confidential data is protected, hiding it, and at the top, confidential contracts are held for data scientists and BI analysts. This can be done with data masking capabilities and different views where raw data is blocked as much as possible and more access is gradually provided until, at the top, administrators are given greater visibility.
You can have different levels of access, which gives a more integrated security.
Perimeter security, data protection and integrated authentication
Governance does not occur without a security at the end point of the chain. It is important to build a good perimeter and place a firewall around the data, integrated with the existing authentication systems and standards. When it comes to authentication, it is important that companies synchronize with proven systems.
With authentication, it is about how to integrate with LDAP (Lightweight Directory Access Protocol), Active Directory and other directory services. You can also support tools such as Kerberos for authentication support. But the important thing is not to create a separate infrastructure, but to integrate it into the existing structure.
Encryption and Data Tokenization
The next step after protecting the perimeter and authenticating all granular data access that is being granted is to make sure that files and personally identifiable information (PII) are encrypted and tokenized from end to end of the data pipeline.
Once the perimeter is exceeded and with access to the system, protecting the PII data is extremely important. It is necessary to encrypt this data so that, regardless of who has access to it, they can execute the analyzes they need without exposing any of these data.
Constant Audit and Analysis
The strategy does not work without an audit. That level of visibility and responsibility at each step of the process is what allows IT to “govern” the data instead of simply setting policies and access controls and expecting the best. It is also how companies can keep their strategies updated in an environment in which the way we see the data and the technologies we use to manage and analyze them are changing every day.
We are in the infancy of Big Data and IoT (Internet of Things), and it is essential to track access and recognize patterns in the data.
A unified data architecture
Ultimately, the IT manager who oversees the enterprise data management strategy must think about the details of granular access, authentication, security, encryption, and auditing. But it should not stop there. Rather, you should think about how each of these components is integrated into your global data architecture. You should also think about how that infrastructure will need to be scalable and secure, from data collection and storage to BI, analytics and other third-party services. The governance of data is as much about rethinking strategy and execution as it is about technology itself.
It goes beyond a set of security rules. It is a unique architecture in which these roles are created and synchronized throughout the platform and all the tools that are contributed to it.
Hope this post has been useful for you and looking forward to your comments!