Big Data and Hadoop Interview Questions and Answers

What is BIG DATA?

Big Data represents a huge and complex data that is difficult to capture, store, process, retrieve and analyze with the help of on-hand traditional database management tools.

What are the three major characteristics of Big Data?

According to IBM, the three characteristics of Big Data are:

Volume: Facebook generating 500+ terabytes of data per day.

Velocity: Analyzing 2 million records each day to identify the reason for losses.

Variety: images, audio, video, sensor data, log files, etc.

What is Hadoop?

Hadoop is a framework that allows distributed processing of large data sets across clusters of commodity hardware(computers) using a simple programming model.

What is the basic difference between traditional RDBMS and Hadoop?

Traditional RDBMS is used for transactional systems to store and process the data, whereas Hadoop is used to store and process large amount of data in the distributed file system.

What are the basic components of Hadoop?

HDFS and MapReduce are the basic components of hadoop.

HDFS is used to store large data sets and MapReduce is used to process such large data sets.

What is HDFS?

HDFS stands for Hadoop Distributed File System and it is designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

What is Map Reduce?

Map Reduce is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters

How Map Reduce works in Hadoop?

MapReduce distributes the workload into two different jobs namely 1. Map job and 2. Reduce job that can run in parallel.

1.The Map job breaks down the data sets into key-value pairs or tuples.

2.The Reduce job then takes the output of the map job and combines the data tuples into smaller set of tuples.

What is a Name node?

Name node is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the data nodes. It is a high-availability machine and single point of failure in HDFS.

What is a Data node?

Data nodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

What is a job tracker?

Job tracker is a daemon that runs on a name node for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. If the job tracker goes down all the running jobs are halted.

How job tracker works?

When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks.

What is a task tracker?

Task tracker is also a daemon that runs on data nodes. Task Trackers manage the execution of individual tasks on slave node.

How task tracker works?

Task tracker is majorly responsible to execute the work assigned by the job tracker and while performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat.

What is Heart beat?

Task tracker communicate with job tracker by sending heartbeat based on which Job tracker decides whether the assigned task is completed or not. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

What is a commodity hardware?

Commodity hardware is a non-expensive systems which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop.

Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

What is a metadata?

Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

What is a daemon?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.

Are Namenode and job tracker on the same host?

No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data of default block size 64MB that can be read or written from or to the HDFS.

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in

the Namenode. So Namenode will identify if the data node is full.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of

the clusters. Most of the time, we do not need to upgrade the Namenode because

it does not store the actual data, but just the metadata, so such a requirement

rarely arise.

On what basis Namenode will decide which datanode to write on?

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

Is client the end user in HDFS?

No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

What is Hadoop Single Point Of Failure (SPOF)

If the Namenode fails, the entire Hadoop system goes down. This is called Hadoop Single Point Of Failure.

What is a Secondary Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system.

Which are the three modes in which Hadoop can be run?

The three modes in which Hadoop can be run are:

1.standalone (local) mode

2.Pseudo-distributed mode

3.Fully distributed mode

What are the features of Stand alone (local) mode?

In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the most least used environments.

What are the features of Pseudo mode?

Pseudo mode is used both for development and in the QA environment. In the

Pseudo mode all the daemons run on the same machine.

Can we call VMs as pseudos?

No, VMs are not pseudos because VM is something different and pesudo is very

specific to Hadoop.

What are the features of Fully Distributed mode?

Fully Distributed mode is used in the production environment, where we have ‘n’

number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster

of machines. There is one host onto which Namenode is running and another host

on which datanode is running and then there are machines on which task tracker

is running. We have separate masters and separate slaves in this distribution.

In which directory Hadoop is installed?

Cloudera and Apache has the same directory structure. Hadoop is installed in

cd/usr/lib/hadoop/

What are the port numbers of Namenode, job tracker and task tracker?

The port number for Namenode is ’50070′, for job tracker is ’50030′ and for task

tracker is ’50060′.

What is the Hadoop-core configuration?

Hadoop core is configured by two xml files:

1.hadoop-default.xml which was renamed to

2.hadoop-site.xml.

These files are written in xml format. We have certain properties in these xml files,

which consist of name and value.

What are the Hadoop configuration files at present?

There are 3 configuration files in Hadoop:

1.core-site.xml

2.hdfs-site.xml

3.mapred-site.xml

These files are located in thehadoop/conf/subdirectory.

How to exit the Vi editor?

To exit the Vi Editor, press ESC and type :q and then press enter.

Which are the three main hdfs-site.xml properties?

The three main hdfs-site.xml properties are:

1.dfs.name.dir which gives you the location on which metadata will be stored and

where DFS is located – on disk or onto the remote.

2.dfs.data.dir which gives you the location where the data is going to be stored.

3.fs.checkpoint.dir which is for secondary Namenode.

What is Cloudera and why it is used?

Cloudera is the distribution of Hadoop. It is a user created on VM by default.

Cloudera belongs to Apache and is used for data processing.

How can I restart Namenode?

1.Click on stop-all.sh and then click on start-all.sh OR

2.Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter)

and then /etc/init.d/hadoop-namenode start (press enter).

What does ‘jps’ command do?

This command checks whether your Namenode, datanode, task tracker, job

tracker, etc are working or not.

How can we check whether Namenode is working or not?

To check whether Namenode is working or not, use the command

/etc/init.d/hadoop-namenode status.

How can we look for the Namenode in the browser?

If you have to look for Namenode in the browser, you don’t have to give

localhost:8021, the port number to look for Namenode in the brower is 50070.

Which files are used by the startup and shutdown commands?

Slaves and Masters are used by the startup and the shutdown commands.

What do slaves consist of?

Slaves consist of a list of hosts, one per line, that host datanode and task tracker

servers.

What do masters consist of?

Masters contain a list of hosts, one per line, that are to host secondary namenode

servers.

What does hadoop-env.sh do?

hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set

over here.

Can we have multiple entries in the master files?

Yes, we can have multiple entries in the Master files.

Where is hadoop-env.sh file present?

hadoop-env.sh file is present in the conf location.

In Hadoop_PID_DIR, what does PID stands for?

PID stands for ‘Process ID’.

What does /var/hadoop/pids do?

It stores the PID.

What does hadoop-metrics.properties file do?

hadoop-metrics.properties is used for ‘Reporting‘ purposes. It controls the reporting

for Hadoop. The default status is ‘not to report‘.

What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave

nodes. It requires password-less SSH connection between the master and all the

slaves and the secondary machines.

On which port does SSH work?

SSH works on Port No. 22, though it can be configured. 22 is the default Port

number.

Can you tell us more about SSH?

SSH is nothing but a secure shell communication, it is a kind of a protocol that

works on a Port No. 22, and when you do an SSH, what you really require is a

password.

Why password is needed in SSH localhost?

Password is required in SSH for security and in a situation where passwordless

communication is not set.

Do we need to give a password, even if the key is added in SSH?

Yes, password is still required even if the key is added in SSH.

What if a Namenode has no data?

If a Namenode has no data it is not a Namenode. Practically, Namenode will have

some data.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the

single point of failure in HDFS.

What happens to a Namenode, when job tracker is down?

When a job tracker is down, it will not be functional but Namenode will be present.

So, cluster is accessible if Namenode is working, even if the job tracker is not

working.

Can you give us some more details about SSH communication between Masters and the Slaves?

SSH is a password-less secure communication where data packets are sent across

the slave. It has some format into which data is sent across. SSH is not only between

masters and slaves but also between two hosts.

What is formatting of the DFS?

Just like we do for Windows, DFS is formatted for proper structuring. It is not

usually done as it formats the Namenode too.

Does the HDFS client decide the input split or Namenode?

No, the Client does not decide. It is already specified in one of the configurations

through which input split is already configured.

In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?

Yes, you can go ahead with this! There are installation steps for creating a new

cluster. You can uninstall your present cluster and install the new cluster.

Can we create a Hadoop cluster from scratch?

Yes we can do that also once we are familiar with the Hadoop environment.