I am using DSE3.2.4
I have created three tables which have 10M rows in one and 50k rows in other and other with just 10 rows
When I run a simple PIG or Hive query over these tables it is running same no.of mappers for both the tables.
In Pig by default pig.splitCombination is true where in it is running only one map
If I set this to false it is now running 513 maps.
In Hive by default it is running 513 maps
I tried in setting the following properties
mapred.min.split.size=134217728 in `mapred-site.xml` now running 513 maps for all
set pig.splitCombination=false in pig shell now running only 1 for all the tables
But no luck
finally I find mapred.map.tasks = 513 in job.xml
I tried to change this in mapred-site.xml but it is not reflecting
please help me in this
The mapper is managed by split size, so don't config it through hadoop settings, try pass &split_size= to your pig url. set "cassandra.input.split.size" for hive
default is 64M
If your Cassandra uses v-node, it creates many splits, so if you data is not big enough, then turn off v-node for hadoop nodes
Related
I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere
I have a one problem in hive.
When a hive query is executed, large tables can not be generated application
Not appears app on yarn monitoring web page. and beeline is still ready
But the small size table works normally.
I do not even know why an app can not be registered.
As I Know, command the query in the hive, the app is registered.(even if resources are insufficient)
the table size is 3.3TB(parquet format)
and my cluster environment is looks like this
Hadoop version : CDH5.8(hadoop 2.6)
Hive : 1.1
node num : 17 datanode
memory total : 1.6TB
core total : 136
hive execution engine : I tested both mr and spark. But same result.
I have a hive external table with 255 columns which has input data size of around 25 GB. This is a single node cluster set up with Hadoop-1.2.1 and hive-0.11.0.
I am able to create tables, databases etc... But when I try a count(*) query in hive, the mapper succeeds but the reducers never start. They are stuck at 0% forever.
The single node machine has a memory of 1TB. Any inputs here will be greatly appreciated.
My suggestion is to use beeline instead of hive, Hive is deprecated so some issues will not be resolved when it is getting deprecated.
I facing hard time to figures out, why i am getting different number of mapper when i run query using hive query and when i run the MR on same hive table using hcatalog.
Difference is significant for same input volume
With Hive Query : 913 mapper
With MR+hcatalog : 3106 mapper
I am using RC file as storage format on table, which i am accessing .
and also I didn't applied any tweak in input split size at both places(hive or MR)
Any clicks, why this is happening, i have tried set mapred.max.split.size=536870912 while running MR, it is also giving me effect to reduce the number of mapper.
I am new to PIG and HDFS. Here is what I am trying to do.
I have a lot of flat text LZO compressed ill formatted server logs files - about 2 GB each getting generated from around 400 servers daily.
I am trying to take advantage of map reduce to format and clean up the data in HDFS using my java formatter and then load the output in Hive.
My problem is that my PIG scripts spawns only one mapper which takes around 15 mins. to read the file sequentially. This is not practical for amount of data I have to load daily in hive.
Here is my pig script.
SET default_parallel 100;
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
register file:/apps/pig/pacudf.jar
raw1 = LOAD '/data/serverx/20120710/serverx_20120710.lzo' USING PigStorage() as (field1);
pac = foreach raw1 generate pacudf.filegenerator(field1);
store pac into '/data/bazooka/';
Looks like mapred.min.split.size setting isn't working. I can see only 1 mapper being initiated which works on the whole 2 GB file on a single server of the cluster. As we have a 100 node cluster I was wondering if I can make use for more servers in parallel if I can spawn more mappers.
Thanks in advance
Compression support in PigStorage does not provide splitting ability. For splittable lzo compression support with pig, you would need the elephant-bird library from twitter. Also to get splitting work (properly ?) with existing regular-lzo files, you would need to index them prior to loading in your pig script.