Dear Readers, Welcome to Hadoop Framework Interview Questions and Answers have been designed specially to get you acquainted with the nature of questions you may encounter during your Job interview for the subject of Hadoop Framework. These Hadoop Framework Questions are very important for campus placement test and job interviews. As per my experience good interviewers hardly plan to ask any particular questions during your Job interview and these model questions are asked in the online technical test and interview of many IT companies.
Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.
Relational database stores data in predefined containers.
Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster
Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.
An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.
Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.
The default hadoop comes with 3 replication factor. You can set the replication level individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.
Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. this to be true in the big clusters that we manage and operate.
In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.
Default blocksize is 64mb. But 128mb is typical.
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides.Based on meta info maps the incoming job to corresponding data node...
Totally 5 daemons run in Hadoop Master-slave architecture.
On Master Node : Name Node and Job Tracker and Secondary name node
On Slave : Data Node and Task Tracker
But its recommended to run Secondary name node in a separate machine which have Master node capacity.
I do define Hadoop into 2 ways :
Distributed Processing : Map - Reduce
Distributed Storage : HDFS
Name Node holds Meta info and Data holds exact data and its MR program.
Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework
How can we control particular key should go in a specific reducer?
By using a custom partitioner.
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.
Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.
can be given as zero. So, the mapper output is an finalised output and stores in HDFS.