Skip to content

Hadoop Interview Questions and Answers

homepage-banner

1. Hadoop Interview Questions

This blog provides the top 100 Hadoop Interview Questions and Answers, covering scenario-based questions for both freshers and experienced candidates. Additionally, it includes guidance on how to explain a Hadoop project during the interview, which can carry significant weight. These real-time Hadoop interview questions will assist you in easily cracking a Hadoop Interview.

2. Basic Hadoop Interview Questions and Answers

This section covers basic Hadoop interview questions and answers for both freshers and experienced candidates. Subsequent sections will focus on Hadoop interview questions related to HDFS and MapReduce.

Q. 1 What is Apache Hadoop?

Hadoop emerged as a solution to “Big Data” problems. It is part of the Apache project sponsored by the Apache Software Foundation (ASF). It is an open-source software framework for distributed storage and processing of large datasets. Open source means it is freely available, and we can even change its source code as per our requirements. Apache Hadoop makes it possible to run applications on a system with thousands of commodity hardware nodes. Its distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure. Apache Hadoop provides:

  • Storage layerHDFS
  • Batch processing engineMapReduce
  • Resource Management LayerYARN

Q.2 Why do we need Hadoop?

The concept of Hadoop was developed to tackle the challenges posed by Big Data. Some of these challenges are:

  • Storage: Due to the sheer size of the data, storing it becomes a difficult task.
  • Security: Ensuring the security of such a large volume of data is a major challenge.
  • Analytics: In many cases, we may not be aware of the nature of the data we are dealing with. This makes analyzing the data even more difficult.
  • Data Quality: Big Data is often messy, inconsistent, and incomplete.
  • Discovery: It is difficult to use a powerful algorithm to find patterns and insights.

Hadoop is an open-source software framework that facilitates the storage and processing of large data sets. Apache Hadoop is the best solution for storing and processing Big Data because:

  • Apache Hadoop stores huge files as raw data without specifying any schema.
  • High scalability: We can add any number of nodes, which significantly enhances performance.
  • Reliability: It stores data reliably on the cluster despite machine failure.
  • High availability: In Hadoop, data is highly available despite hardware failure. If a machine or hardware crashes, we can access data from another path.
  • Economy: Hadoop runs on a cluster of commodity hardware, which is not very expensive.

Q.3 What are the core components of Hadoop?

Hadoop is an open-source software framework for distributed storage and processing of large datasets. The Apache Hadoop core components are HDFS, MapReduce, and YARN.

  • HDFS - Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. It can store very large files running on a cluster of commodity hardware. HDFS works on the principle of storing fewer large files rather than a huge number of small files. HDFS stores data reliably even in the case of hardware failure. It provides high throughput access to an application by accessing in parallel.
  • MapReduce - MapReduce is the data processing layer of Hadoop. It allows you to write applications that process large structured and unstructured data stored in HDFS. MapReduce processes a huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into two phases: Map and Reduce. The Map is the first phase of processing, where you specify all the complex logic code. Reduce is the second phase of processing where you specify lightweight processing like aggregation/summation.
  • YARN - YARN is the processing framework in Hadoop. It provides resource management and allows multiple data processing engines, such as real-time streaming, data science, and batch processing.

Q.4 What are the Features of Hadoop?

The various features of Hadoop are:

  • Open Source – Apache Hadoop is an open-source software framework. This means it is freely available, and we can change its source code as per our requirements.
  • Distributed processing – HDFS stores data in a distributed manner across the cluster, while MapReduce processes data in parallel on the node cluster.
  • Fault Tolerance – Apache Hadoop is highly fault-tolerant. By default, each block creates three replicas across the cluster, and we can customize this as per our needs. If a node goes down, we can recover data from other nodes. The framework recovers failures of nodes or tasks automatically.
  • Reliability – It stores data reliably on the cluster despite machine failure.
  • High Availability – Data is highly available and accessible despite hardware failure. In Hadoop, when a machine or hardware crashes, we can access data from another path.
  • Scalability – Hadoop is highly scalable, as we can add new hardware to the nodes.
  • Economic – Hadoop runs on a cluster of commodity hardware, which is not very expensive. We do not need any specialized machine for it.
  • Ease of use – There is no need for a client to deal with distributed computing, as the framework takes care of everything. Therefore, it is easy to use.

Q.5 Compare Hadoop and RDBMS?

Apache Hadoop is the future of databases because it can store and process a large amount of data which is not possible with traditional databases. The main differences between Hadoop and RDBMS are as follows:

  • Architecture – Traditional RDBMS has ACID properties, whereas Hadoop is a distributed computing framework consisting of two main components: Distributed file system (HDFS) and MapReduce.
  • Data acceptance – RDBMS accepts only structured data, while Hadoop can accept both structured and unstructured data. This is a great feature of Hadoop, as it allows us to store everything in our database without any data loss.
  • Scalability – RDBMS is a traditional database that provides vertical scalability. If the data increases, we need to increase the system configuration. In contrast, Hadoop provides horizontal scalability, so we just need to add one or more nodes to the cluster to increase data storage.
  • OLTP (Real-time data processing) and OLAP – Traditional RDMS supports OLTP (Real-time data processing), whereas OLTP is not supported in Apache Hadoop. Apache Hadoop supports large-scale Batch Processing workloads (OLAP).
  • Cost – RDBMS is licensed software, so we need to pay for it. In contrast, Hadoop is an open-source framework, so we don’t need to pay for it.

3. Hadoop Interview Questions for Freshers

The following Hadoop Interview Questions are meant for freshers and students, but experienced individuals can also refer to them to revise the basics.

Q.6 What are the modes in which Hadoop can run?

Apache Hadoop can run in three modes:

  • Local (Standalone) Mode – Hadoop runs by default in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operations. It is used for debugging purposes and does not support the use of HDFS. Furthermore, in this mode, there is no custom configuration required for configuration files.
  • Pseudo-Distributed Mode – Just like Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node, and thus, both the Master and Slave node are the same.
  • Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.

Q.7 What are the features of Standalone (local) mode?

By default, Hadoop runs in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operations, making it suitable for debugging purposes. However, it does not support the use of HDFS. Standalone mode is only suitable for running programs during development and testing. Furthermore, in this mode, there is no need for custom configuration in configuration files, which include:

  • core-site.xml
  • hdfs-site.xml files.
  • mapred-site.xml
  • yarn-default.xml

Q.8 What are the features of Pseudo mode?

In Pseudo mode, Hadoop can run on a single node just like in Standalone mode. However, in Pseudo-distributed mode, each Hadoop daemon runs in a separate Java process. To use Pseudo mode, configuration for all four files mentioned above is required. In this case, all daemons are running on one node, meaning that the Master and Slave nodes are the same.

Pseudo mode is suitable for both development and testing environments. All daemons run on the same machine in Pseudo mode.

Q.9 What are the features of Fully-Distributed mode?

In Fully-Distributed mode, all daemons execute in separate nodes forming a multi-node cluster, allowing separate nodes for both Master and Slave.

This mode is used in production environments, where ‘n’ number of machines form a cluster. Hadoop daemons run on a cluster of machines, with one host running the NameNode and other hosts running DataNodes. NodeManager is installed on every DataNode and is responsible for executing tasks on each individual DataNode.

The ResourceManager manages all of these NodeManagers. It receives processing requests and passes the corresponding parts of the request to the appropriate NodeManager.

Q.10 What are configuration files in Hadoop?

Core-site.xml – It contains configuration settings for Hadoop core, such as I/O settings that are common to HDFS and MapReduce. It uses the hostname and port, where the most commonly used port is 9000.

Here is an example configuration:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

hdfs-site.xml – This file contains the configuration settings for HDFS daemons. hdfs-site.xml also specifies the default block replication and permission checking on HDFS.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml – In this file, we specify the framework name for MapReduce by setting the mapreduce.framework.name property.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml – This file provides configuration settings for NodeManager and ResourceManager.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Q.11 What are the limitations of Hadoop?

Hadoop has various limitations, including:

  • Issue with small files – Hadoop is not suited for storing small files. Small files create major problems in HDFS, as they are significantly smaller than the HDFS block size (default 128MB). When storing a large number of small files, HDFS cannot handle the load, as it is designed to work with a small number of large files for storing datasets. If too many small files are used, the namenode will become overloaded. Solutions for this issue include using HAR files, Sequence files, and Hbase.
  • Processing Speed – While Hadoop uses parallel and distributed algorithms to process large datasets, the MapReduce process takes a long time to execute as it performs the tasks of Map and Reduce. As data is distributed and processed over the cluster in MapReduce, it increases processing time and reduces speed.
  • Support only Batch Processing – Hadoop supports only batch processing and does not process streamed data, resulting in slower overall performance. The MapReduce framework does not leverage the memory of the cluster to the maximum.
  • Iterative Processing – Hadoop is not efficient for iterative processing, as it does not support cyclic data flow, which is the chain of stages in which the input to the next stage is the output from the previous stage.
  • Vulnerable by nature – Hadoop is entirely written in Java, a language that is widely used and heavily exploited by cyber-criminals, resulting in numerous security breaches.
  • Security- Managing complex applications in Hadoop can be challenging due to the missing encryption at storage and network levels, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.

4. Hadoop Interview Questions for Experienced

The following Hadoop Interview Questions are for experienced, but freshers and students can also read and refer to them for advanced understanding.

Q.12 Compare Hadoop 2 and Hadoop 3?

  • In Hadoop 2, the minimum supported version of Java is Java 7, while in Hadoop 3, it is Java 8.
  • Hadoop 2 handles fault tolerance by replication, which results in wastage of space. While Hadoop 3 handles it by Erasure coding.
  • For data balancing, Hadoop 2 uses HDFS balancer, whereas Hadoop 3 uses Intra-data node balancer.
  • In Hadoop 2, some default ports fall within the Linux ephemeral port range, causing binding failure during startup. But in Hadoop 3, these ports have been moved out of the ephemeral range.
  • Hadoop 2’s HDFS has 200% overhead in storage space, while Hadoop 3 has 50% overhead in storage space.
  • Hadoop 2 has features to overcome SPOF (single point of failure), so whenever the NameNode fails, it recovers automatically. Hadoop 3 recovers SPOF automatically, with no need for manual intervention.

13) What is Data Locality in Hadoop?

One of the major drawbacks of Hadoop was cross-switch network traffic due to the huge volume of data. To overcome this, Data locality was introduced, which refers to the ability to move the computation closer to where the actual data resides on the node, instead of moving large data to computation. This increases the overall throughput of the system.

HDFS in Hadoop stores datasets, which are divided into blocks and stored across the datanodes in the Hadoop cluster. When a user runs a MapReduce job, the NameNode sends the MapReduce code to the datanodes on which data is available related to the MapReduce job.

Data locality has three categories:

  • Data local – In this category, data is on the same node as the mapper working on the data. The proximity of the data is closer to the computation, making this the most preferred scenario.
  • Intra-Rack – In this scenario, the mapper runs on a different node but on the same rack. It is not always possible to execute the mapper on the same datanode due to constraints.
  • Inter-Rack – In this scenario, the mapper runs on a different rack. It is not possible to execute the mapper on a different node in the same rack due to resource constraints.

14) What is Safemode in Hadoop?

Safemode in Apache Hadoop is a maintenance state of the NameNode during which modifications to the file system are not allowed. During Safemode, the HDFS cluster is in read-only mode and blocks are not replicated or deleted. Upon startup, the NameNode performs the following actions:

  • It loads the file system namespace from the last saved FsImage into its main memory along with the edits log file.
  • It merges the edits log file with the FsImage, resulting in a new file system namespace.
  • It then receives block reports containing information about block locations from all datanodes.

During SafeMode, NameNode collects block reports from datanodes. NameNode enters SafeMode automatically during startup and leaves it after DataNodes report that most of the blocks are available. To check the status of SafeMode, use the command:

  • hadoop dfsadmin -safemode get

To enter SafeMode, use:

  • bin/hadoop dfsadmin -safemode enter

To exit SafeMode, use:

  • hadoop dfsadmin -safemode leave

The NameNode front page displays whether SafeMode is on or off.

15) What is the issue with small files in Hadoop?

Hadoop is not well-suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files. A small file in HDFS is smaller than the HDFS block size, which is typically 128 MB by default. If we store a large number of small files, HDFS cannot handle them efficiently. HDFS works best with a small number of large files for storing large datasets, rather than a large number of small files. This is because a large number of small files can overload the NameNode, which stores the namespace of HDFS.

Solution:

Hadoop Archive (HAR) Files deal with the small file issue. HAR introduces a layer on top of HDFS, which provides an interface for file accessing. Using the Hadoop archive command, we can create HAR files. These files run a MapReduce job to pack the archived files into a smaller number of HDFS files. However, reading through files in HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files to read, as well as the data file, this makes it slower.

Sequence Files also deal with the small file problem. In this case, we use the filename as the key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file. We can then process them in a streaming fashion.

16) What is a Distributed Cache in Apache Hadoop?

In Hadoop, data chunks process independently in parallel among DataNodes, using a program written by the user. If we need to access some files from all the DataNodes, we can put those files in the distributed cache.

The MapReduce framework provides Distributed Cache to cache files that the applications need. It can cache read-only text files, archives, jar files, etc.

Once we have cached a file for our job, Hadoop makes it available on each DataNode where map/reduce tasks are running. We can then access the files from all the DataNodes in our map and reduce job.

An application that needs to use distributed cache should ensure that the files are available on URLs, which can be either hdfs:// or http://. If the file is present on the mentioned URLs, the user can specify it as a cache file to the distributed cache. This framework copies the cache file on all the nodes before starting tasks on those nodes. By default, the size of the distributed cache is 10 GB, but we can adjust it using local.cache.size.

17) How is security achieved in Hadoop?

Apache Hadoop achieves security by using Kerberos.

At a high level, there are three steps that a client must take to access a service when using Kerberos, each of which involves a message exchange with a server:

  • Authentication: The client authenticates itself to the authentication server and receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization: The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request: The client uses the service ticket to authenticate itself to the server.

18) Why does one frequently add or remove nodes in a Hadoop cluster?

One of the most important features of Hadoop is its utilization of commodity hardware, which can lead to frequent Datanode crashes in a Hadoop cluster. However, Hadoop also has the advantage of easy scalability to accommodate rapid growth in data volume.

Therefore, administrators frequently add or remove DataNodes in a Hadoop cluster to address these issues.

19) What is throughput in Hadoop?

Throughput refers to the amount of work done in a unit of time. HDFS provides good throughput for the following reasons:

  • HDFS follows a Write Once Read Many (WORM) model. This simplifies data coherency issues as the data is written only once and cannot be modified. Thus, it provides high-throughput data access.
  • Hadoop utilizes the Data Locality principle, which moves computation to data instead of data to computation. This reduces network congestion and enhances the overall system throughput. You can read more about Data Locality in Hadoop.

20) How to restart NameNode or all the daemons in Hadoop?

You can restart the NameNode using the following methods:

  • Stop the NameNode individually using the command /sbin/hadoop-daemon.sh stop namenode. Then, start the NameNode using the command /sbin/hadoop-daemon.sh start namenode.
  • Use the command /sbin/stop-all.sh to stop all daemons first, followed by /sbin/start-all.sh to start all daemons. These script files are located in the sbin directory inside the Hadoop directory.

21) What does the jps command do in Hadoop?

The jps command helps to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons that are currently running on the machine. The daemons include Namenode, Datanode, ResourceManager, NodeManager, etc.

22) What are the main hdfs-site.xml properties?

The hdfs-site.xml file contains configuration settings for HDFS daemons. It also specifies the default block replication and permission checking on HDFS.

The three main hdfs-site.xml properties are:

  1. dfs.name.dir specifies the location where the NameNode stores the metadata (FsImage and edit logs). It also specifies whether DFS should locate it on the disk or in the remote directory.
  2. dfs.data.dir specifies the location of DataNodes where it stores the data.
  3. fs.checkpoint.dir is the directory on the file system where the secondary NameNode stores the temporary images of edit logs. These EditLogs and FsImage will merge for backup.

23) What is fsck?

fsck stands for File System Check. Hadoop’s HDFS uses the fsck (filesystem check) command to check for various inconsistencies and report problems with the files in HDFS, such as missing blocks or under-replicated blocks. It differs from the traditional fsck utility for the native file system and does not correct the errors it detects.

Normally, the NameNode automatically corrects most recoverable failures, and filesystem check also ignores open files. However, it provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command and can also run as bin/hdfs fsck. Filesystem check can run on the whole file system or a subset of files.

Usage:

hdfs fsck <path>

[-list-corruptfileblocks |

[-move | -delete | -openforwrite]

[-files [-blocks [-locations | -racks]]]

[-includeSnapshots]

Path- Start checking from this path.

  • delete- Delete corrupted files.
  • files- Print out the checked files.
  • files –blocks- Print out the block report.
  • files –blocks –locations- Print out locations for every block.
  • files –blocks –rack- Print out network topology for data-node locations.
  • includeSnapshots- Include snapshot data if the given path indicates or includes a snapshottable directory.
  • list -corruptfileblocks- Print the list of missing files and blocks they belong to.

24) How to Debug Hadoop Code?

To debug Hadoop code, follow these steps:

  1. Check the list of MapReduce jobs currently running.
  2. Check whether any orphaned jobs are running. If yes, determine the location of the ResourceManager (RM) logs.
  3. Run ps –ef| grep –I ResourceManager and look for the log directory in the displayed result. Find the job ID from the displayed list, and then check whether there is an error message associated with that job.
  4. On the basis of the RM logs, identify the worker node that is involved in the execution of the task.
  5. Login to that node and run ps –ef| grep –I NodeManager.
  6. Examine the NodeManager log.
  7. Most errors come from user-level logs for each MapReduce job.

25) What is Hadoop Streaming?

Hadoop distribution provides a generic application programming interface (API), which enables writing Map and Reduce jobs in any desired programming language. Hadoop Streaming is a utility that allows creating and running jobs with any executable as Mapper and Reducer.

For example:

hadoop jar hadoop-streaming-3.0.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc

In the example, both the Mapper and Reducer are executables that read input from stdin (line by line) and emit output to stdout. This utility allows for creating and submitting Map/Reduce jobs to an appropriate cluster, as well as monitoring job progress until completion.

Hadoop Streaming uses both streaming command options and generic command options. It is important to place the generic options before the streaming options, otherwise the command will fail.

The general syntax for the command line is:

Hadoop command [genericOptions] [streamingOptions]

26) What is the hadoop-metrics.properties file used for?

Hadoop daemons expose statistical information known as Metrics, which the Hadoop framework uses for monitoring, performance tuning, and debugging purposes.

By default, there are many metrics available, making them very useful for troubleshooting.

The hadoop-metrics.properties file is used for performance reporting in the Hadoop framework. It controls the reporting for Hadoop, and the API provides an abstraction that allows for implementation on top of a variety of metrics client libraries. The choice of client library is a configuration option, and different modules within the same application can use different metrics implementation libraries.

This file is located in the /etc/hadoop directory.

27) How does Hadoop’s CLASSPATH play a vital role in starting or stopping Hadoop daemons?

The CLASSPATH includes all directories containing jar files required to start/stop Hadoop daemons. For example, HADOOP_HOME/share/hadoop/common/lib contains all the utility jar files. It is not possible to start/stop Hadoop daemons if the CLASSPATH is not set.

We can set the CLASSPATH inside the /etc/hadoop/hadoop-env.sh file. The next time Hadoop is run, the CLASSPATH will automatically be added. Therefore, there is no need to add CLASSPATH in the parameters each time it is run.

28) What are the different commands used to startup and shutdown Hadoop daemons?

  • To start all Hadoop daemons, use: ./sbin/start-all.sh. To stop all Hadoop daemons, use: ./sbin/stop-all.sh.
  • You can also start all dfs daemons together using ./sbin/start-dfs.sh, Yarn daemons together using ./sbin/start-yarn.sh, and MR Job history server using /sbin/mr-jobhistory-daemon.sh start history server. To stop these daemons, use:

./sbin/stop-dfs.sh

./sbin/stop-yarn.sh

/sbin/mr-jobhistory-daemon.sh stop historyserver.

  • Finally, you can start all daemons individually and stop them individually:

./sbin/hadoop-daemon.sh start namenode

./sbin/hadoop-daemon.sh start datanode

./sbin/yarn-daemon.sh start resourcemanager

./sbin/yarn-daemon.sh start nodemanager

./sbin/mr-jobhistory-daemon.sh start historyserver

29) What is the role of /etc/hosts in setting up a Hadoop cluster?

The /etc/hosts file contains the hostname and IP address of each host, and maps the IP address to the hostname. In a Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts. This makes it easy to use hostnames instead of IP addresses.

30) How is file splitting invoked in the Hadoop framework?

Input files for Hadoop MapReduce tasks usually reside in HDFS. The InputFormat determines how these input files are split and read. It also creates the InputSplit, which is the logical representation of the data. InputFormat divides the split into records, and the mapper processes each record as a key-value pair.

To invoke file splitting, the Hadoop framework calls the getInputSplit() method of the InputFormat class defined by the user, such as FileInputFormat.

31) Is it possible to provide multiple inputs to Hadoop? If yes, then how?

Yes, it is possible using the MultipleInputs class.

For example:

Suppose we have weather data from the UK Met Office and want to combine it with the NCDC data for our maximum temperature analysis. We can set up the input as follows:

MultipleInputs.addInputPath(job, ncdcInputPath, TextInputFormat.class, MaxTemperatureMapper.class);

MultipleInputs.addInputPath(job, metofficeInputPath, TextInputFormat.class, MetofficeMaxTemperatureMapper.class);

The above code replaces the usual calls to FileInputFormat.addInputPath() and job.setMapperClass(). Since both the Met Office and NCDC data are text-based, we use TextInputFormat for each. We will also use two different mappers, as the two data sources have different line formats. The MaxTemperatureMapper reads NCDC input data and extracts the year and temperature fields. The MetofficeMaxTemperatureMapper reads Met Office input data and extracts the year and temperature fields.

32) Is it possible to have Hadoop job output in multiple directories? If yes, how?

Yes, it is possible using the following approaches:

a. Using the MultipleOutputs class

This class simplifies writing output data to multiple outputs.

MultipleOutputs.addNamedOutput(job, "OutputFileName", OutputFormatClass, keyClass, valueClass);

The API provides two overloaded write methods to achieve this.

MultipleOutput.write(OutputFileName, new Text (key), new Text(value));

Then, we need to use overloaded write method, with an extra parameter for the base output path. This will allow to write the output file to separate output directories.

MultipleOutput.write(OutputFileName, new Text (key), new Text(value), baseOutputPath);

Then, we need to change your baseOutputpath in each of our implementation.

b. Rename/Move the file in driver class-

This is the easiest hack to write output to multiple directories. To write all the output files to a single directory, we can use MultipleOutputs. However, the file names must be different for each category.

5. HDFS Hadoop Interview Questions and Answers

In this section, we have covered the top HDFS Hadoop Interview Questions and Answers. The following Hadoop Interview questions cover what HDFS is, its components, indexing, read-write operations, blocks, replication, and more. So, let’s get started with the Hadoop Interview Questions on HDFS.

1) What is HDFS- Hadoop Distributed File System?

Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storing fewer large files rather than a huge number of small files. HDFS stores data reliably, even in the case of hardware failure. It provides high throughput access to the application by accessing in parallel.

Components of HDFS:

  • NameNode – Also known as Master node, it stores metadata such as the number of blocks, their replicas, and other details.
  • DataNode – Also known as Slave, in Hadoop HDFS, DataNode is responsible for storing actual data. DataNode performs read and write operations as per requests from clients in HDFS.

2) Explanation of NameNode and DataNode in HDFS?

I. NameNode – Also known as the Master node. The Namenode stores meta-data such as the number of blocks, their location, replicas, and other details. This meta-data is stored in memory in the master for faster data retrieval. The NameNode manages and assigns tasks to the slave nodes. It should be deployed on reliable hardware as it is the centerpiece of HDFS.

Tasks of NameNode

  • Manages the file system namespace.
  • Regulates client access to files.
  • In HDFS, the NameNode executes file system execution such as naming, closing, opening files and directories.

II. DataNode – also known as the Slave. In Hadoop HDFS, the DataNode is responsible for storing actual data in HDFS. The DataNode performs read and write operations upon client requests. DataNodes can be deployed on commodity hardware.

Tasks of DataNode

  • In HDFS, the DataNode performs various operations such as block replica creation, deletion, and replication according to the instructions of the NameNode.
  • The DataNode manages the data storage of the system.

3) Why is block size set to 128 MB in Hadoop HDFS?

A block is a continuous location on a hard drive that stores data. In general, a file system stores data as a collection of blocks. HDFS stores each file as blocks and distributes them across the Hadoop cluster. In HDFS, the default size of a data block is 128 MB, which can be configured as per requirements. The block size is set to 128 MB:

  • To reduce disk seeks (IO). The larger the block size, the fewer the file blocks, and the less number of disk seeks and transfers can be done within respectable limits, and that too in parallel.
  • HDFS has huge data sets, i.e. terabytes and petabytes of data. If we take a 4 KB block size for HDFS, just like the Linux file system, which has a 4 KB block size, then we would have too many blocks and too much metadata. Managing this huge number of blocks and metadata will create huge overhead and traffic which is something we don’t want. So, the block size is set to 128 MB.

On the other hand, the block size can’t be so large that the system waits a long time for the last unit of data processing to finish its work.

4) How is data or a file written into HDFS?

When a client wants to write a file to HDFS, it communicates with the Namenode for metadata. The Namenode responds with details of the number of blocks and the replication factor. Then, based on the information from the Namenode, the client splits the file into multiple blocks and starts sending them to the first Datanode. The client sends block A to Datanode 1 along with the details of the other two Datanodes.

When Datanode 1 receives block A sent from the client, it copies the same block to Datanode 2 of the same rack. As both the Datanodes are on the same rack, the block transfer happens via a rack switch. Now, Datanode 2 copies the same block to Datanode 3. As both the Datanodes are on different racks, the block transfer happens via an out-of-rack switch.

After the Datanode receives the blocks from the client, it sends a write confirmation to the Namenode, and then to the client. The same process will repeat for each block of the file. Data transfer happens in parallel for faster writing of blocks.

6. HDFS Hadoop Interview Questions for Freshers

Questions 5 through 8 are basic questions for HDFS Hadoop interviewees who are new to the field. However, experienced individuals can also review these questions for a better understanding of the basics.

5) Can multiple clients write to an HDFS file concurrently?

No, multiple clients cannot write to an HDFS file at the same time. Apache Hadoop HDFS follows a single writer, multiple reader model. When a client opens a file for writing, the NameNode grants a lease. If another client wants to write to that file, it asks the NameNode for the write operation. The NameNode checks whether it has granted the lease for writing to someone else or not. If someone else has already acquired the lease, the write request of the other client will be rejected.

6) How is data or a file read in HDFS?

To read from HDFS, a client first communicates with the NameNode for metadata. The NameNode responds with the name of files and their location. It also provides details of the number of blocks and replication factor. The client then communicates with the DataNode where the blocks are present. The client starts reading data in parallel from the DataNode based on the information received from the NameNode.

Once the client or application receives all the blocks of the file, it combines these blocks to form a file. To improve read performance, the location of each block is ordered by its distance from the client. HDFS selects the replica that is closest to the client to reduce read latency and bandwidth consumption. It first reads the block in the same node, then another node in the same rack, and finally another DataNode in another rack.

7) Why does HDFS store data using commodity hardware despite the higher chance of failures?

HDFS stores data using commodity hardware because it is highly fault-tolerant. HDFS provides fault tolerance by replicating data blocks and distributing them among different DataNodes across the cluster. The default replication factor is 3, but it is configurable. Replicating data solves the problem of data loss in unfavorable conditions, such as node crashes or hardware failures. When any machine in the cluster goes down, clients can still access their data from another machine that contains the same copy of data blocks.

8) How Does HDFS Perform Indexing?

Hadoop uses a unique method for indexing data. The Hadoop framework stores data in block sizes, while HDFS keeps track of the location of the next portion of data by storing the last part of each block. This is the foundation of HDFS itself.

7. HDFS Hadoop Interview Questions for Experienced

The following Hadoop Interview Questions from 9-23 are intended for experienced individuals, but novices and students can also read them for advanced understanding.

9) What is a Heartbeat in HDFS?

A Heartbeat is a signal that the NameNode receives from the DataNodes to indicate that it is functioning (alive). NameNode and DataNode communicate using Heartbeat. If the NameNode does not receive any response from a DataNode after a certain time of the heartbeat, then that Node is considered dead. The NameNode then schedules the creation of new replicas of those blocks on other DataNodes.

Heartbeats from a Datanode also carry information about total storage capacity, the fraction of storage in use, and the number of data transfers currently in progress.

The default heartbeat interval is 3 seconds, but it can be changed using dfs.heartbeat.interval in hdfs-site.xml.

10) How to copy a file into HDFS with a different block size than the existing block size configuration?

To copy a file into HDFS with a different block size, use the following command:

  • Ddfs.blocksize=block_size, where block_size is in bytes.

Here’s an example:

Suppose you want to copy a file called “test.txt” of size 128 MB into HDFS. For this file, you want the block size to be 32 MB (33554432 bytes) instead of the default 128 MB. To do this, issue the following command:

hadoop fs -Ddfs.blocksize=33554432 -copyFromlocal /home/dataflair/test.txt /sample_hdfs

You can check the HDFS block size associated with this file using the following command:

hadoop fs -stat /sample_hdfs/test.txt

Alternatively, you can also use the NameNode web UI to view the HDFS directory.

11) Why does HDFS perform replication, even though it results in data redundancy?

HDFS performs replication to provide fault tolerance. Data replication is one of the most important and unique features of HDFS. Replicating data solves the problem of data loss in unfavorable conditions, such as node crashes or hardware failures. By default, HDFS creates three replicas of each block across the Hadoop cluster, but the user can change this replication factor as needed. If a node goes down, we can recover the data on that node from the other nodes.

Replication in HDFS consumes a lot of space, but users can always add more nodes to the cluster if necessary. It is very rare to have free space issues in a practical cluster, as the primary purpose of deploying HDFS is to store huge datasets. Users can also change the replication factor to save space in HDFS, or use different codecs provided by Hadoop to compress the data.

12) What is the default replication factor and how can you change it?

The default replication factor is 3. There are three ways to change it:

  • Add the following property to hdfs-site.xml: <property> <name>dfs.replication</name> <value>5</value> <description>Block Replication</description> </property>
  • Change the replication factor on a per-file basis using the command: hadoop fs –setrep –w 3 /file_location
  • Change the replication factor for all files in a directory using: hadoop fs –setrep –w 3 –R /directory_location

13) What are Hadoop Archives?

Apache Hadoop HDFS is designed to store and process large data sets, typically in the terabyte range. However, storing a large number of small files in HDFS is inefficient because each file is stored in a block, and block metadata is held in memory by the namenode.

Reading through small files normally results in lots of seeks and results in inefficient data access patterns, often requiring hopping from datanode to datanode to retrieve each small file.

Hadoop Archive (HAR) is designed to address the issue of small files. HAR packs a number of small files into a large file, so you can access the original files in parallel, transparently (without expanding the files) and efficiently.

Hadoop Archives are special format archives that map to a file system directory. They always have a “.har” extension. Moreover, Hadoop MapReduce uses Hadoop Archives as input.

14) What is meant by the High Availability of a NameNode in Hadoop HDFS?

In Hadoop 1.0, the NameNode is a single point of failure (SPOF). If the NameNode fails, all clients, including MapReduce jobs, would be unable to read or write files or list files. In such an event, the entire Hadoop system would be out of service until a new NameNode is brought online.

Hadoop 2.0 overcomes this single point of failure by providing support for multiple NameNodes. The high availability feature adds an extra NameNode (active standby NameNode) to the Hadoop architecture, which is configured for automatic failover. If the active NameNode fails, then the standby NameNode takes over all the responsibilities of the active node and the cluster continues to work.

The initial implementation of HDFS NameNode high availability provided only for a single active NameNode and a single standby NameNode. However, some deployments require a high degree of fault tolerance. This is enabled by the new version 3.0, which allows the user to run multiple standby NameNodes. For instance, by configuring three NameNodes and five journal nodes, the cluster is able to tolerate the failure of two nodes rather than just one.

15) What is Fault Tolerance in HDFS?

Fault tolerance in HDFS refers to the ability of a system to operate effectively even in unfavorable conditions, such as node crashes or hardware failures. HDFS ensures fault tolerance through the process of replica creation. When a client stores a file in HDFS, the file is divided into blocks and distributed across different machines in the HDFS cluster. Each block is replicated on other machines in the cluster. By default, HDFS creates three copies of each block on other machines in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions, the data can still be accessed from other machines in the cluster where the block replica is present.

16) What is Rack Awareness?

Rack Awareness is a concept that improves network traffic while reading or writing files in Hadoop. It involves choosing the DataNode that is closer to the same rack or a nearby rack. NameNode maintains the rack IDs of each DataNode to achieve rack information. This information is then used to choose Datanodes based on their rack proximity. In HDFS, NameNode ensures that all replicas of a block are not stored on the same rack or a single rack. This approach follows the Rack Awareness Algorithm to reduce latency and improve fault tolerance.

The default replication factor in Hadoop is 3, according to the Rack Awareness Algorithm. Therefore, the first replica of a block is stored on a local rack, the next replica is stored on another Datanode within the same rack, and the third replica is stored on a different rack.

Rack Awareness is essential in Hadoop because it improves:

  • Data high availability and reliability.
  • The performance of the cluster.
  • Network bandwidth.

17) What is the Single Point of Failure in Hadoop?

In Hadoop 1.0, the NameNode is a single point of failure (SPOF). If the NameNode fails, all clients are unable to read/write files, and the entire Hadoop system is out of service until a new NameNode is up and running.

Hadoop 2.0 addresses this SPOF by providing support for multiple NameNodes. The high availability feature adds an extra NameNode to the Hadoop architecture, providing automatic failover. If the active NameNode fails, the Standby NameNode takes over all the responsibilities of the active node, and the cluster continues to function.

The initial implementation of NameNode high availability provided for a single active/standby NameNode. However, some deployments require a higher degree of fault-tolerance. In version 3.0, this feature is enabled by allowing the user to run multiple standby NameNodes, for example, by configuring three NameNodes and five journal nodes. This way, the cluster can tolerate the failure of two nodes instead of just one.

18) What is Erasure Coding in Hadoop?

By default, HDFS replicates each block three times for redundancy and to shield against the failure of datanodes. However, replication is very expensive and has a 200% overhead in storage space and other resources.

To address this issue, Hadoop 3.x introduced Erasure Coding, a new feature that provides the same level of fault tolerance with less storage space and a 50% storage overhead compared to replication.

Erasure Coding uses Redundant Array of Inexpensive Disks (RAID) and implements EC through striping. It divides logical sequential data (such as a file) into smaller units, such as bits, bytes, or blocks, and stores them on different disks.

Encoding: In this process, RAID calculates and sorts parity cells for each strip of data cells, and recovers errors through the parity. Erasure Coding extends a message with redundant data for fault tolerance. The EC codec operates on uniformly sized data cells. In Erasure Coding, the codec takes a number of data cells as input and produces parity cells as output. The data cells and parity cells together are called an erasure coding group.

Two algorithms are available for Erasure Coding: XOR Algorithm and Reed-Solomon Algorithm.

19) What is Disk Balancer in Hadoop?

Disk Balancer is a command line tool provided by HDFS to distribute data evenly on all disks of a datanode. It moves blocks from one disk to another and works by creating a plan (set of statements) and executing that plan on the datanode. The plan describes how much data should move between two disks and composes multiple steps. Each move step has a source disk, a destination disk, and the number of bytes to move. The plan executes against an operational datanode.

By default, disk balancer is not enabled. To enable disk balancer, you must set dfs.disk.balancer.enabled to true in hdfs-site.xml.

When a new block is written to HDFS, the datanode uses a volume choosing policy to select the disk for the block. Each directory is considered a volume in HDFS terminology. Two such policies are Round-robin, which distributes new blocks evenly across available disks, and Available space, which writes data to the disk with the most free space (by percentage).

20) How can you check if your NameNode is operational?

There are different ways to check the status of the NameNode. One common method is to use the jps command, which displays the status of all daemons running in the HDFS.

21) Is the Namenode machine the same as the DataNode machine in terms of hardware?

Unlike DataNodes, a NameNode is a highly available server that manages the File System Namespace and maintains metadata information, including the number of blocks, their location, replicas, and other details. It also executes file system operations such as naming, closing, and opening files/directories.

Therefore, the NameNode requires higher RAM for storing metadata for millions of files, whereas the DataNode is responsible for storing actual data in HDFS and performing read and write operations as per client requests. Consequently, the DataNode needs to have a higher disk capacity for storing large datasets.

22) Understanding File Permissions in HDFS and How HDFS Checks Permissions for Files or Directories.

In Hadoop distributed file system (HDFS), a permissions model is implemented for files and directories. For each file or directory, we can manage permissions for three distinct user classes: the owner, group, and others. Each user class has three different permissions: Read ®, Write (w), and Execute (x).

For files, the r permission allows users to read the file, while the w permission allows users to write to the file. For directories, the r permission allows users to list the contents of the directory, while the w permission allows users to create or delete the directory. The x permission allows users to access a child of the directory.

When it comes to checking permissions for files or directories in HDFS, the system follows these steps:

  • If the username matches the owner of the directory, HDFS checks the owner’s permissions.
  • If the group matches the directory’s group, Hadoop tests the user’s group permissions.
  • If the owner and the group names don’t match, Hadoop tests the “other” permission.
  • If none of the permission checks succeed, the client’s request is denied.

23) If the number of DataNodes increases, do we need to upgrade the NameNode?

The NameNode stores metadata, such as the number of blocks, their location, and replicas. This metadata is available in memory on the master node for faster retrieval of data. The NameNode manages the slave nodes and assigns tasks to them. It also executes file system execution, such as naming, closing, and opening files/directories.

During Hadoop installation, the framework determines the appropriate NameNode based on the size of the cluster. In most cases, upgrading the NameNode is unnecessary because it does not store the actual data. However, it stores metadata, so such a requirement rarely arises.

8. MapReduce Hadoop Interview Questions and Answers

This section discusses the top MapReduce Hadoop Interview Questions for freshers and experienced professionals. The questions cover topics such as Hadoop MapReduce, key-value pairs, mappers, combiners, reducers, and more. Let’s get started with the MapReduce Interview questions on MapReduce.

Questions 1-6 are intended for freshers, but experienced professionals can also refer to these Hadoop Interview Questions for basic understanding:

1) What is Hadoop MapReduce?

MapReduce is the data processing layer of Hadoop. It is a framework for writing applications that process vast amounts of data stored in the HDFS. MapReduce processes a huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into two phases: Map and Reduce.

  • Map- This is the first phase of processing. In this phase, you specify all the complex logic, business rules, and costly code. The map takes a set of data and converts it into another set of data. It also breaks individual elements into tuples (key-value pairs).
  • Reduce- This is the second phase of processing. In this phase, you specify light-weight processing like aggregation or summation. Reduce takes the output from the map as input. After that, it combines tuples (key-value) based on the key. Then, it modifies the value of the key accordingly.

2) Why Hadoop MapReduce?

When we store a huge amount of data in HDFS, the first question that arises is: how can we process this data?

Transferring all this data to a central node for processing is not practical; we would have to wait forever for the data to transfer over the network. Google faced this same problem with its Distributed Google File System (GFS) and solved it using a MapReduce data processing model.

Challenges before MapReduce:

  • Costly – Storing all the data (terabytes) in a single server or database cluster is very expensive and hard to manage.
  • Time-consuming – Analyzing terabytes of data using a single machine takes a lot of time.

How MapReduce overcomes these challenges:

  • Cost-efficient – It distributes the data over multiple low-config machines.
  • Time-efficient – We can write the analysis code in the Map function and the integration code in the Reduce function and execute it. This MapReduce code will go to every machine that has a part of our data and execute on that specific part. Instead of moving terabytes of data, we move only kilobytes of code. This type of movement is time-efficient.

3) What is the key-value pair in MapReduce?

Hadoop MapReduce implements a data model that represents data as key-value pairs. Both input and output to the MapReduce Framework should be in key-value pairs only.

In Hadoop, if the schema is static, we can directly work on the column instead of key-value. However, if the schema is not static, we will work on keys and values. Keys and values are not intrinsic properties of the data. Instead, the user analyzing the data chooses a key-value pair. A key-value pair in Hadoop MapReduce is generated in the following way:

  • InputSplit- It is the logical representation of data. InputSplit represents the data that individual Mapper will process.
  • RecordReader- It communicates with the InputSplit (created by InputFormat) and converts the split into records. Records are in the form of key-value pairs that are suitable for reading by the mapper. By default, RecordReader uses TextInputFormat for converting data into a key-value pair.

Key- It is the byte offset of the beginning of the line within the file. Thus, it will be unique if combined with the file.

Value- It is the contents of the line, excluding line terminators. For example, if the file content is “on the top of the crumpetty Tree,” the key would be 0 and the value would be “on the top of the crumpetty Tree.”

4) Why does MapReduce use key-value pairs to process data?

MapReduce is capable of working with unstructured and semi-structured data, in addition to structured data. While structured data can be read by columns, handling unstructured data is possible using key-value pairs. The idea at the core of MapReduce is based on these pairs. The framework maps data into a collection of key-value pairs using the mapper and reducer on all pairs with the same key. According to Google’s research publication, in most computations:

  • The map operation applies to each logical “record” in our input, computing a set of intermediate key-value pairs.
  • The reduce operation is then applied to all values that share the same key, combining the derived data properly.

In conclusion, we can say that key-value pairs are the best solution for working on data problems with MapReduce.

5) How many Mappers run for a MapReduce job in Hadoop?

Each Mapper task processes an input record and generates a key-value pair. The number of mappers in a MapReduce job depends on two factors:

  • The amount of data to be processed, along with the block size. This depends on the number of InputSplit. For example, if we have a block size of 128 MB and we expect 10 TB of input data, we will have 82,000 maps. Ultimately, the number of maps is determined by the InputFormat.
  • The configuration of the slave, including the number of cores and RAM available. Ideally, the number of map per node should be between 10-100. The Hadoop framework should give 1 to 1.5 cores of the processor to each mapper. Thus, for a 15-core processor, 10 mappers can run.

In a MapReduce job, changing the block size can control the number of Mappers. The number of InputSplit increases or decreases accordingly. Additionally, one can manually increase the number of map tasks using JobConf’s conf.setNumMapTasks(int num).

The formula for the number of mappers is: Mapper = (total data size) / (input split size). For example, if the data size is 1 TB and the input split size is 100 MB, then the number of mappers is Mapper = (1000 * 1000) / 100 = 10,000.

6) How many Reducers run for a MapReduce job in Hadoop?

A reducer takes a set of intermediate key-value pairs produced by the mapper as input. It then runs a reduce function on each pair to generate the final output, which it stores in HDFS. Typically, the reducer performs aggregation or summation computations.

The number of reducers for a job can be set by the user with Job.setNumReduceTasks(int). The optimal number of reducers can be determined with the formula:

*0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).*

With 0.95, all reducers can launch immediately and start transferring map outputs as soon as the map finishes.

With 1.75, the faster nodes finish the first round of reduces and then launch the second wave of reduces.

Increasing the number of reducers:

  • Increases framework overhead
  • Improves load balancing
  • Reduces the cost of failures

9. MapReduce Hadoop Interview Questions for Experienced

The following Hadoop interview questions are more technical and targeted towards experienced professionals, although freshers can also refer to them for better understanding.

7) What is the difference between Reducer and Combiner in Hadoop MapReduce?

The Combiner is a mini-reducer that performs a local reduce task. The Combiner runs on the Map output and produces the output to reducer input. A combiner is usually used for network optimization. The Reducer takes a set of intermediate key-value pairs produced by the mapper as input. It then runs a reduce function on each of them to generate the output. The output of the reducer is the final output.

  • Unlike a reducer, the combiner has a limitation, i.e. the input or output key and value types must match the output types of the mapper.
  • Combiners can only operate on a subset of keys and values, i.e. combiners can execute on functions that are commutative.
  • Combiner functions take input from a single mapper, while reducers can take data from multiple mappers as a result of partitioning.

8) What happens if the number of reducers is 0 in Hadoop?

If the number of reducers is set to 0, no reducer will execute, and no aggregation will take place. In such cases, it is preferable to use a “Map-only job” in Hadoop. In a map-only job, the map performs all tasks with its InputSplit, and the reducer does no job. The map output is the final output.

Between the map and reduce phases, there is a key, sort, and shuffle phase. The sort and shuffle phase sorts the keys in ascending order and groups values based on the same keys. This phase is expensive. If the reduce phase is not required, it should be avoided. Avoiding the reduce phase eliminates the sort and shuffle phase, saving network congestion. When the data size is huge, large data travels to the reducer when shuffling an output of the mapper to the reducer.

9) What do you mean by shuffling and sorting in MapReduce?

Shuffling and sorting take place after the completion of the map task. The shuffle and sort phases in Hadoop occur simultaneously.

Shuffling is the process of transferring data from the mapper to the reducer. During this process, the system sorts the key-value output of the map tasks and transfers it to the reducer.

The shuffle phase is necessary for the reducer, as it provides input to the reducer. Shuffling can start before the map phase finishes, saving time and completing the task more quickly.

Sorting - The mapper generates the intermediate key-value pair. Before starting the reducer, the MapReduce framework sorts these key-value pairs by the keys. Sorting helps the reducer easily distinguish when a new reduce task should start, saving time.

Shuffling and sorting are not performed if you specify zero reducers (setNumReduceTasks(0)).

If you have any doubts about Hadoop Interview Questions and Answers, please drop a comment and we will get back to you.

10) What is the fundamental difference between a MapReduce InputSplit and HDFS block?

Definition:

  • Block – A block is a continuous location on the hard drive where HDFS stores data. In general, a FileSystem stores data as a collection of blocks. Similarly, HDFS stores each file as blocks and distributes them across the Hadoop cluster.
  • InputSplit – An InputSplit represents the data that an individual Mapper will process. Further splits divide the data into records, and each record (which is a key-value pair) will be processed by the map.

Data Representation:

  • Block – A block is the physical representation of data.
  • InputSplit – An InputSplit is the logical representation of data. During data processing in a MapReduce program or other processing techniques, InputSplit is used. In MapReduce, it is important to note that InputSplit does not contain the input data. It is merely a reference to the data.

Size:

  • Block: The default size of an HDFS block is 128 MB, which can be configured based on our requirements. All blocks in a file are of the same size, except for the last block. The last block can be the same size or smaller. In Hadoop, files are split into 128 MB blocks and then stored in the Hadoop Filesystem.
  • InputSplit: The split size is approximately equal to the block size by default.

Example:

Let’s consider an example where we need to store a file in HDFS. HDFS stores files as blocks, which are the smallest unit of data that can be stored or retrieved from disk. The default block size is 128MB. Since we have a file of 130MB, HDFS will break this file into two blocks and store them on different nodes in the cluster.

However, if we want to perform a MapReduce operation on the blocks, it will not process properly as the second block is incomplete. This is where InputSplit comes in. InputSplit groups blocks together as a single block, solving the problem of incomplete blocks. InputSplit includes the location of the next block and the byte offset of the data needed to complete the block.

It’s important to note that InputSplit is only a logical chunk of data that contains information about block addresses or locations. During MapReduce execution, Hadoop scans through the blocks and creates InputSplits. Split acts as a broker between the block and mapper.

11) What is Speculative Execution in Hadoop MapReduce?

Hadoop Interview Questions and Answers – Speculative Execution

In MapReduce, jobs are broken down into tasks that run in parallel rather than sequentially, which reduces overall execution time. However, slow tasks can slow down the overall execution of a job, and it can be difficult to diagnose the cause of the slow task, since they still complete successfully although taking more time than expected.

Instead of trying to diagnose and fix slow running tasks, Apache Hadoop detects them and runs backup tasks called Speculative Tasks. When all tasks for the job are launched, Hadoop MapReduce launches speculative tasks for tasks that have been running for some time (one minute) and have not made much progress compared to other tasks from the job. If the original task completes before the speculative task, the speculative task is killed. Conversely, if the speculative task finishes before the original task, the original task is killed.

12) How can extra files (jars, static files) be submitted for a MapReduce job during runtime?

The MapReduce framework provides Distributed Cache to cache files needed by applications. It can cache read-only text files, archives, jar files, etc.

To use distributed cache to distribute a file, the application must ensure that the files are available on URLs, which can be either hdfs:// or http://. If the file is present on the hdfs:// or http://urls, the user can mention it as a cache file to distribute. This framework will copy the cache file on all the nodes before starting tasks on those nodes. The files are only copied once per job, and applications should not modify them.

By default, the size of the distributed cache is 10 GB. The size of the distributed cache can be adjusted using local.cache.size.

Leave a message