In order to understand ‘Big Data’, we first need to know what ‘data’ is. Oxford dictionary defines ‘data’ as –
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
So, ‘Big Data’ is also a data but with a huge size. ‘Big Data’ is a term used to describe collection of data that is huge in size and yet growing exponentially with time that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information.
Big data is often characterized by 3Vs: the extreme volume of data, the wide variety of data types and the velocity at which the data must be processed. Although big data doesn’t equate to any specific volume of data, the term is often used to describe terabytes, petabytesand even exabytes of data captured over time.
Volume: Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
Velocity: Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety :Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Why Is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable
- Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.
- Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with ‘Big Data’ technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
- Early identification of risk to the product/services, if any
- Better operational efficiency
‘Big Data’ technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of ‘Big Data’ technologies and data warehouse helps organization to offload infrequently accessed data.
- Cost reductions
- Time reductions
- New product development and optimized offerings
- Smart decision making.
When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
- Determining root causes of failures, issues and defects in near-real time.
- Generating coupons at the point of sale based on the customer’s buying habits.
- Recalculating entire risk portfolios in minutes.
- Detecting fraudulent behaviour before it affects your organization.
Examples Of ‘Big Data’
Following are some the examples of ‘Big Data’-
The New York Stock Exchange generates about one terabyte of new trade data per day.
Social Media Impact
Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
The human side of big data analytics
Ultimately, the value and effectiveness of big data depends on the human operators tasked with understanding the data and formulating the proper queries to direct big data projects. Some big data tools meet specialized niches and allow less technical users to make various predictions from everyday business data. Still, other tools are appearing, such as Hadoop appliances, to help businesses implement a suitable compute infrastructure to tackle big data projects, while minimizing the need for hardware and distributed compute software know-how.
But these tools only address limited use cases. Many other big data tasks, such as determining the effectiveness of a new drug, can require substantial scientific and computational expertise from the analytical staff. There is currently a shortage of data scientists and other analysts who have experience working with big data in a distributed, open source environment.
Big data can be contrasted with small data, another evolving term that’s often used to describe data whose volume and format can be easily used for self-service analytics. A commonly quoted axiom is that “big data is for machines; small data is for people.
The next step has two components: data integration, and “big little data”, i.e. the tremendous proliferation of small data sets across the web.
The success of big data has been in finding correlations and trends only discernible across a large dimension like time or population. But even the largest of the big data sets are a small fraction of all data, and any one data set only reveals a small facet of a broader story. Data integration enables combining disparate data sets along their common facets. A few examples include augmenting your data with sentiment data, government inflation / jobs / demographics or other published metrics, and data from predictive models to help plan for customer demand and infrastructure needs.
As for “big little data”, this is an interesting challenge in making use of the tremendous amount of structured data in HTML tables throughout the web, while handling the ambiguous meaning and context surrounding the table.
These two components come together in an interesting way. As with databases, data integration has been around for some time but has not evolved at the same pace as data management systems. Data integration is currently not suited to blending massive data sets with numerous small data sets, since a big bottleneck in data integration is in requiring human involvement to help identify the common facets between two otherwise unrelated data sets. This requires data cleaning, entity resolution and other challenging tasks for which the human brain is still a better pattern matching system than algorithms. Improving our algorithms to make this scalable is an important challenge.
Finally the remark is that researchers will likely advance data integration before any systems mature for managing the proliferation of web data.