I want each hadoop mapper to process a separate portion of data at a M/R job and I would like to test on a pseudo-distributed (single-node) setup the case where many mappers would be necessary to exist as a result of a bigger input-data size. Given the size of my current input and the standalone mode I am experimenting on, I can only see 1 map task.
My input comes from an hbase table and I thought that the number of regions per hbase table is equal to the number of mappers used to process the table's data.
So, as to reproduce a case where many mappers would process the input data, I predefined regions of table through shell like this :
create 't1', 'f1', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
or setting 'UniformSplit' as SPLITALGO, but even if mappers indeed increase to the specified number of regions (after importing data to the respective table), all the input data (at a subsequent test job where I try to read from this table) pass through only one mapper - with the others processing none of the input rows.
I work on a pseudo-distributed (single-node) setup and I really don't know how to solve this. Does anyone have any ideas? Thanks!
Are you scanning the entire table or just a section of it? If you are scanning a section of the table, then that might be the cause of your problem as your data source isn't big enough to trigger multiple mappers.
You can try to decrease the region size in your hbase-size.xml configuration and restart hbase to achieve the desired effect.
Lastly, in your mapred-site.xml configuration, how many mapper slots do you have? If it is just 1, this will not limit the number of map jobs, but it will limit the number of map jobs that can be run at a time on that server.
Other than that, I don't think you have much control over specifying the number of mappers per job- not like you do with the number of reducers.
I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere
I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.
I am using
Hbase:0.92.1-cdh4.1.2, and
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/ has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
shortly speaking I need a way to give Hadoop MapRedice API hint on what host I'd like to run certain reducer based on its partition. Is there any way?
Somewhat longer story:
I have few mapper tasks which generate (or import from another source) records for certain HBase table. Emitted records have ImmutableBytesWritable as keys. Number of reducers for this job exactly matches number of table regions and custom partitioner is used to distribute records so records of every region gets to appropriate reducer.
Reducers are intended to generate HFile images, one image per region so later bulk load could be used on them. The only serious problem here is I'd like reducers at least to 'try to run' on the same hosts appropriate region servers are running. This is to get good probability of generated HFiles locality (in terms of HDFS) for appropriate HBase region servers.
Any idea how to get this behavior?
Alternative could be how to 'request' HDFS file to 'get local'. Having this I could start another MR job with mappers bound to region servers (through splits) and request corresponding HFile to get local.
There is no out-of-box way to do this yet, short of writing a custom scheduler, which would be an overkill.
An upstream ticket does track this feature request at https://issues.apache.org/jira/browse/MAPREDUCE-199.
I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.
It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.