HDFS sequence file performance tuning - performance

I'm trying to use Hadoop to process a lot of small files which are stored in sequence file. My program is highly IO bound so I want to make sure that IO throughput is high enough.
I wrote a MR program that reads small sample files from sequence file and write these files to ram disk (/dev/shm/test/). There's another stand alone program that will delete files written to ram disk without any computation. So the test should be almost pure IO bound. However, the IO throughput is not as good as I expected.
I have 5 datanode and each of the datanode has 5 data disk. Each disk can provide about 100MB/s throughput. Theoretically this cluster should be able to provide 100MB/s * 5 (disks) * 5 (machines) = 2500MB/s. However, I get about 600MB/s only. I run "iostat -d -x 1" on the 5 machines and found that the IO loading is not well balanced. Usually only a few of the disk have 100% utilization, some disks have very low utilization ( 10% or less). And some machine even have no IO loading at some time. Here's the screenshot. (Of course the loading for each disk/machine varies quickly)
Here's another screenshot the shows CPU usage by "top -cd1" command:
Here're some more detailed config about my case:
Hadoop cluster hardware: 5 Dell R620 machines which equipped with 128GB ram and 32 core CPU (actually 2 Xeon E5-2650). 2 HDD consist of a RAID 1 disk for CentOS and 5 data disks for HDFS. So uou can see 6 disks in the above screenshot.
Hadoop settings: block size 128MB; data node handler count is 8; 15 maps per task tracker; 2GB Map reduce child heap process.
Testing file set: about 400,000 small files, total size 320GB. Stored in 160 sequence files, each seq file has the size about 2GB. I tried to store all the files in many different size seq files(1GB, 512MB, 256MB, 128MB), but the performance didn't change much.
I won't expect the whole system can have 100% IO throughput(2500MB/s), but I think 40% (1000MB/s) or more should be reasonable. Can anyone provide some guide for performance tuning?

I solved the problem myself. Hint: the high CPU usage.
It's very abnormal that the CPU usage is so high since it's an almost pure IO job.
The root cause is that each task node gets about 500 map and each map use exactly one JVM. By default, hadoop map reduce is configured to create a new JVM for a new map.
Solution: Modify the value of "mapred.job.reuse.jvm.num.tasks" from 1 to -1, which indicates that the JVM will be reused without limitation.

Related

How many mappers and reducers would be advised to process 2TB of data in Hadoop?

I am trying to develop a Hadoop project for one of our clients. We will be receiving a data of around 2 TB per day, so as a part of reconciliation we would like to read the 2 TB of data and perform sorting and filter operations.
We have set up the Hadoop cluster with 5 data nodes running on t2x.large AWS instances containing 4 CPU cores and 16GB RAM. What is the advisable count of mappers and reducers we need to launch to complete the data processing quickly?
Take a look on this:
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-1/
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-2/
This depends on the task nature if it is RAM or CPU consuming and how parallel your system can be.
If every node contains 4 CPU cores and 16GB RAM. On average I suggest 4 to 6 map-reduce task on each node.
Creating too much mapred tasks will degrade your cpu performance and you may face container problems regarding not enough memory.

what's the actual ideal NameNode memory size when meet a lot files in HDFS

I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Yarn and MapReduce resource configuration

I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram.
My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want to run some Hive jobs on about 500-600 of those files.
Due to the incongruent input file size, I havent changed blocksize in Hadoop so far. As I understand best-case scenario would be if blocksize = input file size, but will Hadoop fill that block until its full if the file is less than blocksize? And how does the size and amount of input files affect performance, as opposed to say one big ~40 GB file?
And how would my optimal configuration for this setup look like?
Based on this guide (http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/) I came up with this configuration:
32 GB Ram, with 2 GB reserved for the OS gives me 30720 MB that can be allocated to Yarn containers.
yarn.nodemanager.resource.memory-mb=30720
With 8 cores I thought a maximum of 10 containers should be safe. So for each container (30720 / 10) 3072 MB of RAM.
yarn.scheduler.minimum-allocation-mb=3072
For Map Task Containers I doubled the minimum container size, which would allow for a maximum of 5 Map Tasks
mapreduce.map.memory.mb=6144
And if I want a maximum of 3 Reduce task I allocate:
mapreduce.map.memory.mb=10240
With JVM heap size to fit into the containers:
mapreduce.map.java.opts=-Xmx5120m
mapreduce.reduce.java.opts=-Xmx9216m
Do you think this configuration would be good, or would you change anything, and why?
Yeah, this configuration is good. But few changes I would like to mention.
For reducer memory, it should be
mapreduce.reduce.memory.mb=10240(I think its just a typo.)
Also one major addition I will suggest will be the cpu configuration.
you should put
Container Virtual CPU Cores=15
for Reducer as you are running only 3 reducers, you can give
Reduce Task Virtual CPU Cores=5
And for Mapper
Mapper Task Virtual CPU Cores=3
number of containers that will be run in parallel in (reducer OR
mapper) = min(total ram / mapreduce.(reduce OR map).memory.mb, total
cores/ (Map OR Reduce) Task Virtual CPU Cores).
Please refer http://openharsh.blogspot.in/2015/05/yarn-configuration.html for detailed understading.

The memory consumption of hadoop's namenode?

Can anyone give a detailed analysis of memory consumption of namenode? Or is there some reference material ? Can not find material in the network.Thank you!
I suppose the memory consumption would depend on your HDFS setup, so depending on overall size of the HDFS and is relative to block size.
From the Hadoop NameNode wiki:
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
From https://twiki.opensciencegrid.org/bin/view/Documentation/HadoopUnderstanding:
Namenode: The core metadata server of Hadoop. This is the most critical piece of the system, and there can only be one of these. This stores both the file system image and the file system journal. The namenode keeps all of the filesystem layout information (files, blocks, directories, permissions, etc) and the block locations. The filesystem layout is persisted on disk and the block locations are kept solely in memory. When a client opens a file, the namenode tells the client the locations of all the blocks in the file; the client then no longer needs to communicate with the namenode for data transfer.
the same site recommends the following:
Namenode: We recommend at least 8GB of RAM (minimum is 2GB RAM), preferably 16GB or more. A rough rule of thumb is 1GB per 100TB of raw disk space; the actual requirements is around 1GB per million objects (files, directories, and blocks). The CPU requirements are any modern multi-core server CPU. Typically, the namenode will only use 2-5% of your CPU.
As this is a single point of failure, the most important requirement is reliable hardware rather than high performance hardware. We suggest a node with redundant power supplies and at least 2 hard drives.
For a more detailed analysis of memory usage, check this link out:
https://issues.apache.org/jira/browse/HADOOP-1687
You also might find this question interesting: Hadoop namenode memory usage
There are several technical limits to the NameNode (NN), and facing any of them will limit your scalability.
Memory. NN consume about 150 bytes per each block. From here you can calculate how much RAM you need for your data. There is good discussion: Namenode file quantity limit.
IO. NN is doing 1 IO for each change to filesystem (like create, delete block etc). So your local IO should allow enough. It is harder to estimate how much you need. Taking into account fact that we are limited in number of blocks by memory you will not claim this limit unless your cluster is very big. If it is - consider SSD.
CPU. Namenode has considerable load keeping track of health of all blocks on all datanodes. Each datanode once a period of time report state of all its block. Again, unless cluster is not too big it should not be a problem.
Example calculation
200 node cluster
24TB/node
128MB block size
Replication factor = 3
How much space is required?
# blocks = 200*24*2^20/(128*3)
~12Million blocks
~12,000 MB memory.
I guess we should make the distinction between how namenode memory is consumed by each namenode object and general recommendations for sizing the namenode heap.
For the first case (consumption) ,AFAIK , each namenode object holds an average 150 bytes of memory. Namenode objects are files, blocks (not counting the replicated copies) and directories. So for a file taking 3 blocks this is 4(1 file and 3 blocks)x150 bytes = 600 bytes.
For the second case of recommended heap size for a namenode, it is generally recommended that you reserve 1GB per 1 million blocks. If you calculate this (150 bytes per block) you get 150MB of memory consumption. You can see this is much less than the 1GB per 1 million blocks, but you should also take into account the number of files sizes, directories.
I guess it is a safe side recommendation. Check the following two links for a more general discussion and examples:
Sizing NameNode Heap Memory - Cloudera
Configuring NameNode Heap Size - Hortonworks
Namenode Memory Structure Internals

Resources