Bigdata Hadoop Interview Questions
1)What is BigData?
Now a days ,data which is comes from different sources such as facebook,twitter,gmail,supermarket,sensors,e-commerce ,hospital,offices…etc.data should be available on both structured and unstructured format.Bigdata may be important to business and society.The real issue is not that you are acquiring large amounts of data. It’s what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making.
2)What is Hadoop?
Hadoop is a open source project which is purly based on java framework.Hadoop is mainly used for storing and processing large data i,e unstructured and structured,in distributed computing environment.Hadoop should run across clusters of commodity servers.It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
3)How should hadoop solving the bigdata problems?
First, we have to see challenges of bigdata
Bigdata is structured and unstructured and semi-structured data.we can’t store and process large data in traditional RDBMS which can’t cope with storing billions of rows of data.so,by using hadoop we can store and process(unstructured and semi-structured data)
Hadoop is built to run on a cluster of machines–actual data should be stored on different nodes in cluster with a very high degree of fault tolerance and high availbility.
4)what are characteristics of Bigdata?
Volume:They are many factors to increase data for different sources. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
Velocity:Data is streaming in at unprecedented speed and must be dealt with in a timely manner. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
Variety:Nowadays data which is available on different format. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
5)What is MapReduce?
Mapreduce is proceesing model in hadoop,which can processed any type of data i,e structured and unstructured data.It is mainly came from divide and conquer strategy.MapReduce job is a unit of work ,which consists of input data,MapReduce program and configurations.Hadoop runs the job by diving into two task:map task and reduce task.
6)What are different types of filesystem?
Filesystem is used for to control how to stored and retreived data.they are different file system,each have different structure and logic, properties of speed, flexibility, security, size and more.Disk filesystems are filesystems put on hard-drives and memory cards. Such filesystems are designed for this type of hardware. Common examples include NTFS, ext3, HFS+, UFS, XFS, and HDFS. Flash drives commonly use disk filesystems like FAT32
7)What is HDFS?
HDFS is the filesystem which is used in Hadoop based distributed filesystem. The Hadoop is an open-source distributed computing framework and provided by Apache. Many network stations use it to create systems such as Amazon, Facebook. The Hadoop cores are Mapreduce and HDFS. HDFS filesystem designed for storing large files with streaming data acess,running on commodity hardware.
8)What is the default block size in hdfs?
Hdfs is one of the file system which have default block size is 64mb.we can configure block size i,e set block size.
9)what is difference between GFS and HDFS?
Google FileSystem is a distributed file system developed by Google and specially designed to provide efficient, reliable access to data using large clusters of commodity servers. Files are divided into chunks of 64 megabytes, and are usually appended to or read and only extremely rarely overwritten or shrunk.Compared with traditional file systems, GFS is designed and optimized to run on data centers to provide extremelyhigh data throughputs, low latency and survive individual server failures.
Inspired by GFS, the open source Hadoop Distributed File System (HDFS) stores large files across multiple machines.It achieves reliability by replicating the data across multiple servers. Similarly to GFS, data is stored on multiple geo-diverse nodes. The file system is built from a cluster of data nodes, each of which serves blocks of data over the network using a block protocol specific to HDFS.In order to perform the certain operations in GFS and HDFS a programming model is required. GFS has its own programming model called Mapreduce. It is an open-source programming model developed by Google Inc.Apache adopted the ideas of Google Mapreduce and developed Hadoop Mapreduce.
10)What are the related projects of hadoop(hadoop ecosystem)?
11)What is difference between structured data and unstructured data?
Structured data have labels, by using labels we can processing the data(example databases,excelsheet etc)
UnStructured data doesn’t have labels example some text don’t have labels,images,videos,weblogs,etc..,
12)What is difference between RDBMS and Mapreduce?
RDBMS is supported for only gigabytes of size,but in mapreduce supported for petabytes of data
RDBMS can read and write many times but in mapreduce which can write once,read many times
13)What is NameNode?
In HDFS cluster has two types of nodes,which are working as master and slave.Here,namenode is master.NameNode should store filesystem namespace which containing filesytem tree and metadata for all of the files and directories in the tree.This information stored in local disk in form of two files(namespace image,edit log).NameNode also knows datanode on which actual data is located. without namenode filesystem cannot used.
14)What is DataNode?
DataNode are workers i,e slaves ,of the file system.It can store and retreieve block of data told by the client through namenode which containing metadata of the file and directories.
15)What is HDFS Federation?
The NameNode has references of all files and block in filesystem in memory which means a large cluster with many files,memory becoming limit.HDFS Federation is available in 2.x series,in which wen can adding namenodes,each namenode manages a portion of the filesytem namespace example: /user —for namenode1,/tmp—namenode2.Two namenodes are independent to each other.if ,namenode1 is failure which not effected to another namenode2.
16)What is JobTracker?
They are two types of nodes to control the job execution process :jobtracker and tasktracker.JobTracker which coordinaties all jobs run on the system by scheduling task run on tasktracker
17)What is TaskTracker?
TaskTrackers can run task given by jobtracker and send progress to jobtracker which can store record of overall performance of job.if any task is failure,jobtracker an reschedule it on another tasktracker
18)Whatis Secondary NameNode?
Secondary NameNode is usually running on a seperate physical machine because it requries plenty of cpu and as much memory as namenode to perform merge.it keeps merge namespace which is used for when namenode id failure.
19)What mode should we run hadoop?
20)What is replication factor in hadoop?
In HDFS cluster which containing block of data is replicated into across cluster.the default replication factor in hdfs is 3.
21)How can we start hadoop?
start-all.sh which is for starting all components in a cluster
22)How can we stop hadoop?
stop-all.sh which is for stoping all components in a cluster
23)What are file permissions in hdfs?
hdfs has a permission model for files and directories that is much like posix
they are three types of permissions
each file and directory has an owner and group and mode
24)How can client reading data from hdfs?
1)client calls open() for Distributed filesytem,
2)DFS calls the namenode through RPC
3)NameNode returns the address of datanodes that have copy of block,to DFS.
4)DFS returns the FSDatainputstream to client for reading data from datanode(DfsInputStream which has stored datanode address ,connects to closest datanode).
5)when the end of block is reached,dfsinputstream will close connection to datanode
6)select another best datanode for reading data,then end of block reached,dfsinputstream will close connection to datanode.
Repeats untill when client finished reading data from datanodes
25)How can client writing data to hdfs?
1)client creates the file by calling create() on DFS which makes RPC call to namenode for creating new file in the filesystem namespace.
2)DFS returns FSDataoutputstream to client for writing data .
3)client writes data,FSDataoutputstream split it into packets,which writes into internal queue is called data queue.
data queue is consumed by datastreamer which is responsible for asking namenode to allocate new blocks by picking a list suitable datanode to store replicas.
4)DFSOutputstream also maintain an internal queue of packets that are waiting for ack from datanode.
5)when client finished writing data,it calls close() function
26)What is data-integrity in hdfs?
HDFS tranparently checksums all data written to it and by default verifies checksums when reading data.A seperate checksums created for every bytes of data(default is 512 bytes ,because CRC-32 checksums is 4 bytes).Datanodes are responsible for verfing the data they recevie before storing the data and checksums.
It is possible to disable checksums by passing false to setverifychecksum() method on filesystem before using open() method to read file.
27)What is localfilesystem in hadoop?
The hadoop localfilesystem performing client side checksuming.it means when client write file into filesystem which creates a hidden file(filename.crc)in the same directory containing the checksums for each chunk of the file.the chunk size is stored as metadata in the .crc file,so file can be read back correctly even if the setting for the chunk size has changed.checksums are verified when read and if an error is occured,localfile system throw a checksumexception.
28)What are compression format should supported by hadoop?
a)DEFLATE b)gzip c)bzip2 d)LZO e)LZ4 f)Snappy
29)How would codecs useful to hadoop?
A codecs is the implementation of compression-decompression algorithm.In hadoop codecs is represented by an implementation of compressioncodec interface.
30)Which compression format should we use?
Depends on the considerations as file size,format and the tools for processing.
using container file format such as sequence file,RCfile which are all support for both compression and splitting.A fast compressor such as LZO,LZ4 or snappy
31)Can large files are supported for compression in mapreduce?
For large files,you should not use compression format that not supporting for spliting on whole file,because you lose locality and make mapreduce applications very ineffcient.