We live in the era of big data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external big data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and emails. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of big data is machine data which consists of real-time information from sensors and web logs that monitor customer’s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of big data is not going to stop.
Although big data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures”. In simple terms, big data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has increased 10000 times and is currently at 10 petabytes (1 petabyte = 1000000 gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on big data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters.
To further simplify our big data understanding, we can rely on three major characteristics of big data i.e. volume, variety and velocity which are more commonly referred as 3 V’s of big data. Occasionally, some resources do talk about a not so common characteristic of big data i.e. Veracity which is referred as the 4th V of big data. All these 4 characteristics provide more details around the nature, scope and structure of the big data.
Volume deals with the size aspect of big data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by 2015. A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data.
Big data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required capabilities to process this kind of information.
Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of big data. Another important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle big data.
Though enterprises have access to lot of big data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the big data era where most of the data-capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in big data before using it for any kind of analysis.
The idea of Data Science and owning a successful career in the same is exciting you? Then UNext’s PG Certificate Program in Data Science and Machine Learning, in association with Manipal Institute of Technology is your best option. Designed and delivered by the best minds in this industry, this program ensures you are provided with a detailed insight into the latest tools and techniques of Data Science and how to leverage the same to become a skilled Data Scientist.