How does hadoop configuration work - hadoop

I see we can configure different parameters in hadoop clusters. Bit confused if we configure master, these configurations are replicated in client nodes? Or every nodes should be configured separately?
For eg, like setting block size as 128MB in master, so all client nodes will have 128MB or since those nodes are not configured will it be in default value 64MB? If master setting is used then for configs where system parameters are considered like no of cores how to handle those?

Configuration in Hadoop is more complex. Actually, hadoop lets the API users to decide how to use the configuration.
For example, let's discovery how the file block size is determined. The file block size uses the value of fs.local.block.size in the configuration.
fs.local.block.size is not set in the configuration in the client side
This situation is that conf.get("fs.local.block.size"); returns null in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
Then fs.local.block.size use the default value, which is 32MB (32 * 1024 * 1024).
However, if you write a MapReduce job to output some files (I assume you use TextOutputFormat, some custom output format may change the following behaviour), the file block size is determined by the configuration of the TaskTracker. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
fs.local.block.size is set in the configuration in the client side
This situation is that you can use conf.get("fs.local.block.size"); to get the value of fs.local.block.size in the client side.
If you use the following codes (the codes is in your client) to create a File in HDFS,
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream output = fs.create(new Path("/new/file/in/hdfs"));
// write your data to output...
The fs.local.block.size is conf.get("fs.local.block.size"). FileSystem.create
However, if you write a MapReduce job to output some files, it's a little complex.
If in one TaskTracker, fs.local.block.size is not final, the block size of output files in this TaskTracker will be fs.local.block.size in the client side. Because the job configuration will be commited to the TaskTracker.
If in this TaskTracker, fs.local.block.size is final, as fs.local.block.size can not be overrided by the job configuration, the block size in this TaskTracker will be fs.local.block.size in the TaskTracker node. So in this situation, if your configuration is inconsistent in different nodes, you may find that the MapReduce output files have different block sizes.
The above analysis is only appropriate for fs.local.block.size. For other configuration, you may need to read the related source codes.
At last, I recommend that making all of your configurations consistent to avoid to trap in strange behaviours.

Related

When is the Block Placement Policy used?

I know that dfs.block.replicator.classname property can be used to change the BlockPlacementPolicy. I want to know when exactly is this policy used to place data? Like is it used when -copyFromLocal/-put are executed?
I think the output of a job will also be placed according to this policy.
And secondly, the property when specified in the conf file will affect the entire hadoop cluster. If I am using a shared cluster, is there a way to change the BlockPlacement policy for only jobs that are executed under my user, or is there a way to change the policy for each job?
I am using the hadoop streaming jar on a 4 node cluster.
The block placement policy is used, whenever a new block of data is written to HDFS. It could be when the data is ingested into HDFS or a job writes data into HDFS etc. It is used for optimal placement of blocks, so that there is a uniform distribution blocks in a HDFS cluster.
For e.g. the algorithm used by default block placement policy class (BlockPlacementPolicyDefault) is:
The replica placement strategy is that if the writer is on a datanode,
the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack. The 3rd
replica is placed on a datanode which is on a different node of the rack as
the second replica.
The block placement policy is also used by the following HDFS utilities:
Balancer: Balances disk space usage on HDFS. In this case the BlockPlacementPolicy could be used for placing the blocks to other nodes, in order to re-balance the cluster
NamenodeFsck: - Utility to check the HDFS for inconsistencies. In this case BlockPlacementPolicy is used for checking the number of mis-replicated blocks.
You can have your own custom block placement class. To do that you need to extend BlockPlacementPolicy class and set the configuration parameter dfs.block.replicator.classname to your custom class name in hdfs-site.xml.
By default BlockPlacementPolicyDefault class is used for block placement:
final Class<? extends BlockPlacementPolicy> replicatorClass = conf.getClass(
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_DEFAULT,
BlockPlacementPolicy.class);
You can't change the block placement policy for each job. The reason for this is, the block placement policy is instantiated once, when the NameNode comes up.
Following is the sequence of calls, to initialize BlockPlacementPolicy. These steps are executed, when the NameNode is started:
Initialize NameNode, when NameNode is started
NameNode::initialize(conf); // Initialize NameNode
NameNode::loadNamesystem(conf); // Load name system
Initialize FsNameSystem. FsNameSystem does all book keeping work on NameNode
FSNamesystem.loadFromDisk(conf); // Loads FS Image from disk
Instantiate BlockManager. This is called while instantiating FsNameSystem
this.blockManager = new BlockManager(this, conf);
InstantiateBlockPlacementPolicy. This is called by BlockManager.
blockplacement = BlockPlacementPolicy.getInstance(
conf, datanodeManager.getFSClusterStats(),
datanodeManager.getNetworkTopology(),
datanodeManager.getHost2DatanodeMap());
Since this is instantiated once, you can't change this for each job.

Hadoop: specify yarn queue for distcp

On our cluster we have set up dynamic resource pools.
The rules are set so that first yarn will look at the specified queue, then to the username, then to primary group ...
However with a distcp I can't seem to be able to specify a queue, it just sets it to the primary group.
This is how I run it now (which doesn't work):
hadoop distcp -Dmapred.job.queue.name:root.default .......
You are committing a mistake in the specification of the parameter.
You should not use ":" for separating the key/value pairs. You should use "=".
The command should be
hadoop distcp -Dmapred.job.queue.name=root.default .......
-Dmapreduce.job.queuename=root.default
Similarly, hadoop archive can be instructed to target a custom queue :
hadoop archive -Dmapreduce.job.queuename='<leaf.queue.name> ...
I take the opporunity of this response to give a tip for hadoop archive:
as it will create one map task per file to create (by default, the destination file size is 2GB). This can lead to thousands of maps when archiving terabytes of data.
The size of part-* files of hadoop archives is controlled with undocumented har.partfile.size : you can increase it by setting a value (in bytes) higher than 2GiB with -Dhar.partfile.size=<value in bytes>

Hadoop Yarn - how to request fix number of containers

How can Apache Spark or Hadoop Mapreduce request a fixed number of containers?
In Spark yarn-client mode, it can be requested by setting the configuration spark.executor.instances, which is directly related to the number of YARN containers it gets. How does Spark transform this into a Yarn parameter that is understood by Yarn?
I know by default, it can depend upon number of splits and configuration values yarn.scheduler.minimum-allocation-mb, yarn.scheduler.minimum-allocation-vcores. But Spark has ability to exactly request fixed number of containers. How can any AM do that?
In Hadoop Map reduce, Number of containers for map task is decided based on number of input splits. It is based on the size of source file. for every Input split, one map container will be requested.
By default number of Reducer per job is one. It can be customized by passing arguments to mapreduce.reduce.tasks. Pig & Hive has different logic to decide number of reducers. ( this also can be customized).
One container (Reduce container, usually bigger than map container) will be requested per reducers.
Total number of mappers & reducers will be clearly defined in job config file during job submission.
I think it's by using AM api that yarn provides. AM provider can use rsrcRequest.setNumContainers(numContainers); http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client
Here I had similar discussion on other questionl. Yarn container understanding and tuning

Hadoop: force 1 mapper task per node from jobconf

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

How can I force Flume-NG to process the backlog of events after a sink failed?

I'm trying to setup Flume-NG to collect various kinds of logs from a bunch of servers (mostly running Tomcat instances and Apache Httpd) and dump them into HDFS on a 5-node Hadoop cluster. The setup looks like this:
Each application server tails the relevant logs into a one of the Exec Sources (one for each log type: java, httpd, syslog), which outs them through a FileChannel to an Avro sink. On each server the different sources, channels and sinks are managed by one Agent. Events get picked up by an AvroSource which resides on the Hadoop Cluster (the node that also hosts the SecondaryNameNode and the Jobtracker). For each logtype there is an AvroSource listening on a different port. The events go through the FileChannel into the HDFS Sink, which saves the events using the FlumeEventAvro EventSerializer and Snappy compression.
The problem: The agent on the Hadoop node that manages the HDFS Sinks (again, one for each logtype) failed after some hours because we didn't change the Heap size of the JVM. From then on lots of events were collected in the FileChannel on that node and after that also on the FileChannels on the Application Servers, because the FileChannel on the Hadoop node reached it's maximum capacity. When I fixed the problem, I couldn't get the agent on the Hadoop node to process the backlog speedily enough so it could resume normal operation. The size of the tmp dir where the FileChannel saves the events before sinking them, keeps growing all the time. Also, HDFS writes seem to be real slow.
Is there a way to force Flume to process the backlog first before ingesting new events? Is the following configuration optimal? Maybe related: The files that get written to HDFS are really small, around 1 - 3 MB or so. That's certainly not optimal with the HDFS default blocksize of 64MB and with regards to future MR operations. What settings should I use to collect the events in files large enough for the HDFS blocksize?
I have a feeling the config on the Hadoop node is not right, I'm suspecting the values for BatchSize, RollCount and related params are off, but I'm not sure what the optimal values should be.
Example config on Application Servers:
agent.sources=syslogtail httpdtail javatail
agent.channels=tmpfile-syslog tmpfile-httpd tmpfile-java
agent.sinks=avrosink-syslog avrosink-httpd avrosink-java
agent.sources.syslogtail.type=exec
agent.sources.syslogtail.command=tail -F /var/log/messages
agent.sources.syslogtail.interceptors=ts
agent.sources.syslogtail.interceptors.ts.type=timestamp
agent.sources.syslogtail.channels=tmpfile-syslog
agent.sources.syslogtail.batchSize=1
...
agent.channels.tmpfile-syslog.type=file
agent.channels.tmpfile-syslog.checkpointDir=/tmp/flume/syslog/checkpoint
agent.channels.tmpfile-syslog.dataDirs=/tmp/flume/syslog/data
...
agent.sinks.avrosink-syslog.type=avro
agent.sinks.avrosink-syslog.channel=tmpfile-syslog
agent.sinks.avrosink-syslog.hostname=somehost
agent.sinks.avrosink-syslog.port=XXXXX
agent.sinks.avrosink-syslog.batch-size=1
Example config on Hadoop node
agent.sources=avrosource-httpd avrosource-syslog avrosource-java
agent.channels=tmpfile-httpd tmpfile-syslog tmpfile-java
agent.sinks=hdfssink-httpd hdfssink-syslog hdfssink-java
agent.sources.avrosource-java.type=avro
agent.sources.avrosource-java.channels=tmpfile-java
agent.sources.avrosource-java.bind=0.0.0.0
agent.sources.avrosource-java.port=XXXXX
...
agent.channels.tmpfile-java.type=file
agent.channels.tmpfile-java.checkpointDir=/tmp/flume/java/checkpoint
agent.channels.tmpfile-java.dataDirs=/tmp/flume/java/data
agent.channels.tmpfile-java.write-timeout=10
agent.channels.tmpfile-java.keepalive=5
agent.channels.tmpfile-java.capacity=2000000
...
agent.sinks.hdfssink-java.type=hdfs
agent.sinks.hdfssink-java.channel=tmpfile-java
agent.sinks.hdfssink-java.hdfs.path=/logs/java/avro/%Y%m%d/%H
agent.sinks.hdfssink-java.hdfs.filePrefix=java-
agent.sinks.hdfssink-java.hdfs.fileType=DataStream
agent.sinks.hdfssink-java.hdfs.rollInterval=300
agent.sinks.hdfssink-java.hdfs.rollSize=0
agent.sinks.hdfssink-java.hdfs.rollCount=40000
agent.sinks.hdfssink-java.hdfs.batchSize=20000
agent.sinks.hdfssink-java.hdfs.txnEventMax=20000
agent.sinks.hdfssink-java.hdfs.threadsPoolSize=100
agent.sinks.hdfssink-java.hdfs.rollTimerPoolSize=10
There are a couple of things I see in your configuration that can cause issues:
Your first agent seems to have an avro sink with batch size of 1. You should bump this up to at least 100 or more. This is because the avro source on the second agent would be committing to the channel with batch size of 1. Each commit causes an fsync, causing the file channel performance to be poor. The batch size on the exec source is also 1, causing that channel to be slow as well. You can increase the batch size (or use the Spool Directory Source - more on that later).
You can have multiple HDFS sinks reading from the same channel to improve performance. You should just make sure that each sink writes to a different directory or have different "hdfs.filePrefix", so that multiple HDFS sinks don't try to write to the same files.
Your batch size for the HDFS sink is 20000, which is quite high, and your callTimeout is the default of 10 seconds. You should increase "hdfs.callTimeout" if you want to keep such a huge batch size. I'd recommend reducing the batch size to 1000 or so, and having timeout of about 15-20 seconds. (Note that at the current batch size, each file holds only 2 batches - so reduce the batch size, increase the rollInterval and timeOut)
If you are using tail -F, I'd recommend trying out the new Spool Directory Source. To use this source, rotate out your log files to a directory, which the Spool Directory Source processes. This source will only process files which are immutable, so you need to rotate the log files out. Using tail -F with exec source has issues, as documented in the Flume User Guide.

Resources