Big Data - Hadoop Interview Questions – Sandeep Kanao
What is Big Data? - Hadoop Interview Questions – Sandeep Kanao
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.
What is Grid, cloud and cluster - Hadoop FAQ Questions – Sandeep Kanao
Cloud: is simply an aggregate of computing power. You can think of the entire "cloud" as single server, for your purposes. It's conceptually much like an old school mainframe where you could submit your jobs to and have it return the result, except that nowadays the concept is applied more widely. (I.e. not just raw computing, also entire services, or storage ...)
Grid: a grid is simply many computers which together might solve a given problem/crunch data. The fundamental difference between a grid and a cluster is that in a grid each node is relatively independent of others; problems are solved in a divide and conquer fashion.
Cluster: conceptually it is essentially smashing up many machines to make a really big & powerful one. This is a much more difficult architecture than cloud or grid to get right because you have to orchestrate all nodes to work together, and provide consistency of things such as cache, memory, and not to mention clocks. Of course clouds have much the same problem, but unlike clusters clouds are not conceptually one big machine, so the entire architecture doesn't have to treat it as such. You can for instance not allocate the full capacity of your data center to a single request, whereas that is kind of the point of a cluster: to be able to throw 100% of the oomph at a single problem.
What are the examples of Big Data? - Hadoop Interview Questions – Sandeep Kanao
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
Social Media Data: Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.
Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
Power Grid Data: The power grid data holds information consumed by a particular node with respect to a base station.
Transport Data: Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types.
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media Logs.
What are Big Data Technologies - Hadoop Interview Questions – Sandeep Kanao
There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology:
Operational Big Data
These include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.
Analytical Big Data
These includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.
What are the major challenges associated with Big Data? - Hadoop Interview Questions – Sandeep Kanao
The major challenges associated with big data are as follows:
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
These problems could be solved using an algorithm called MapReduce, introduced by Google. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.