How do I control which tasks run on which hosts? - hadoop

I'm running Giraph, which executes on our small CDH4 Hadoop cluster of five hosts (four compute nodes and a head node - call them 0-3 and 'w') - see versions below. All five hosts are running mapreduce tasktracker services, and 'w' is also running the jobtracker. Resources are tight for my particular Giraph application (a kind of path-finding), and I've discovered that some configurations of the automatically-scheduled hosts for tasks work better than others.
More specifically, my Giraph command (see below) specifies four Giraph workers, and when executing, Hadoop (Zookeeper actually, IIUC) creates five tasks that I can see in the jobtracker web UI: one master and four slaves. When it puts three or more of the map tasks on 'w' (e.g., 01www or 1wwww), then that host maxes out ram, cpu, and swap, and the job hangs. However, when the system spreads the work out more evenly so that 'w' has only two or fewer tasks (e.g., 123ww or 0321w), then the job finishes fine.
My question is, 1) what program is deciding the task-to-host assignment, and 2) how do I control that?
Thanks very much!
Versions
CDH: 4.7.3
Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha" (CHANGELOG starts with: Release 1.0.0 - 2013-04-15)
Zookeeper Client environment: zookeeper.version=3.4.5-cdh4.4.0--1, built on 09/04/2013 01:46 GMT
Giraph command
hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=wright.cs.umass.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.JsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.JsonAdjacencyListTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 4

Related

Solr indexing performance

We are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.
With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours.
Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?
We are using the MapReduceIndexerTool and there are no network problems. We are reading compressed files from HDFS and decompressing them in our morphline. This is the way we run our script:
cmd_hdp=$(
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf" hadoop --config /etc/hadoop/conf.cloudera.yarn \
jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D morphlineVariable.ZK_HOST=hostname1:2181/solr \
-D morphlineVariable.COLLECTION=my_collection \
-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \
--output-dir hdfs://isilonhostname:8020/tmp/my_tmp_dir \
--morphline-file morphlines/my_morphline.conf \
--log4j log4j.properties \
--go-live \
--collection my_collection \
--zk-host hostname1:2181/solr \
hdfs://isilonhostname:8020/my_input_dir/
)
The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers.
We don't need to make queries for the moment, we just need to index historical data. We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete.

Merging small files in hadoop

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute.
After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
Now my question is how i will get the next 10 files excluding the first 10 files?
can some please help me
Here is one more alternate, this is still the legacy approach pointed out by #Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
step 1 : create a tmp directory
hadoop fs -mkdir tmp
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
step 5 - remove output
hadoop fs -rm -R output/
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
Steps to schedule a cron job for merging small files
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
step 2: schedule the script using cron to run every minute using cron expression
a) edit crontab by choosing an editor
>crontab -e
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
The merge job will be scheduled to run for every minute.
Hope this was helpful.
#Andrew pointed you to a solution that was appropriate 6 years ago, in a batch-oriented world.
But it's 2016, you have a micro-batch data flow running and require a non-blocking solution.
That's how I would do it:
create an EXTERNAL table with 3 partitions, mapped on 3 directories
e.g. new_data, reorg and history
feed the new files into new_data
implement a job to run the batch compaction, and run it periodically
Now the batch compaction logic:
make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
select all files that are ripe for compaction (define your own
criteria) and move them from new_data directory to reorg
merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
drop the files in reorg
So it's basically the old 2010 story, except that your existing data flow can continue dumping new files into new_data while the compaction is safely running in separate directories. And in case the compaction job crashes, you can safely investigate / clean-up / resume the compaction without compromising the data flow.
By the way, I am not a big fan of the 2010 solution based on a "Hadoop Streaming" job -- on one hand, "streaming" has a very different meaning now; on the second hand, "Hadoop streaming" was useful in the old days but is now out of the radar; on the gripping hand [*] you can do it quite simply with a Hive query e.g.
INSERT INTO TABLE blahblah PARTITION (stage='history')
SELECT a, b, c, d
FROM blahblah
WHERE stage='reorg'
;
With a couple of SET some.property = somevalue before that query, you can define what compression codec will be applied on the result file(s), how many file(s) you want (or more precisely, how big you want the files to be - Hive will run the merge accordingly), etc.
Look into https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties under hive.merge.mapfiles and hive.merge.mapredfiles (or hive.merge.tezfiles if you use TEZ) and hive.merge.smallfiles.avgsize and then hive.exec.compress.output and mapreduce.output.fileoutputformat.compress.codec -- plus hive.hadoop.supports.splittable.combineinputformat to reduce the number of Map containers since your input files are quite small.
[*] very old SF reference here :-)

Input Split always 2 in hadoop cluster HDinsight

I deployed a 5 node hadoop MR cluster in Azure. I am using a bash script to perform chaining. I am using Hadoop streaming API, as my implementation is in Python.
My input data is always in one file but the size of the file ranges from 1 mb to 2 gb.
I want to create multiple mappers to handle this. I tried running using this command:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -D mapred.max.split.size=5000000 -files map2.py,red2.py -mapper map2.py -reducer red2.py -input wasb:///example/${ipfileA} -output wasb:///example/${opfile}
Here I have set my maximum split size to be 5mb.
I also tried to set my maximum block size to 5 mb.
However, when I run this, the number of input splits is always 2
mapreduce.JobSubmitter: number of splits:2
And number of map tasks launched are also always 2.
I want the number of map tasks to be dynamically set based on the size of the data. What should I do?

Morphline Read one big file

I have a Hive table that i am trying to index into SolrCloud using morphline, however, the data behind the Hive table is ONE big file 20GB that morphline is taking a long time to process.
Instead of running multiple mappers and reducers, there can only be 1 mapper running, probably due to the fact that we have only one file.
yarn jar /opt/<path>/search-mr-1.0.0-cdh5.5.1-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphlines.conf \
--output-dir hdfs://<outputdir> \
--zk-host node1.datafireball.com:2181/solr \
--collection <collectionname> \
--input-list <filewherethedatais> \
--mappers 6
And it still kicked out only 1 job... and this is taking forever, can anyone shed some light on this?
Resources You might find helpful:
Cloudera Mapreduce Batch Index into Solrcloud
Kitesdk which morphline belongs to.

Hadoop streaming with single mapper

I am using Hadoop streaming, I start the script as following:
../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \
-mapper ../tests/mapper.php \
-reducer ../tests/reducer.php \
-input data \
-output out
"data" is 2.5 GB txt file.
however in ps axf I can see only one mapper. i tried with -Dmapred.map.tasks=10, but result is the same - single mapper.
how can I make hadoop split my input file and start several mapper processes?
To elaborate on my comments - If your file isn't in HDFS, and you're running with local runner then the file itself will only be processed by a single mapper.
A large file is typically processed by several mappers due to the fact that it is stored in HDFS as several blocks.
A 2.5 GB file, with a block size of 512M will be split into ~5 blocks in HDFS. If the file is splittable (plain text, or using a splittable compression codec such as snappy, but not gzip), then hadoop will launch a mapper per block to process the file.
Hope this helps explain what you're seeing

Resources