I have gone through the sqoop documentation and did not find the information on why sqoop-1 does not have reducer phase. Can someone please explain this.
The purpose of the Reducer is to aggregate the input values and return a single output value.
Look at the simple example of WordCount in MapReduce. The Reducer is used to aggregate the number of occurrences of a single word.
Since the nature of a Sqoop job is to fetch the input records from the given RDBMS and put the records into the given output directory in HDFS or into a Hive table, the job does not require any aggregation and therefore no Reduce phase is needed.
Reduce phase is not needed when all tasks can be executed in parallel.
Sqoop does not need reducer because it imports/exports data between RDBMS and HDFS file system (or Hive tables.).
since RDBMS consists of structured data there is not need shuffle or sort and aggregation can be done in mapper it self.
Related
I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere
I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.
I have a map reduce job (say Job1) in which the mapper extends
Mapper < Object, Object, KeySet, ValueSet >
Lets say I want to do summation of all values in ValueSet in the reduction step.
After reducing (key, Iterable), I want to write the final reduced values to HBase table instead of HDFS, in reducer of Job1 The table in HBase will be used for future jobs.
I know I can write a mapper only Job2 which reads the reduced file in HDFS (written by Job1) and import the data to HBase table, but I want to avoid two redundant I/O operations.
I don't want to change the Mapper class of Job1 to write to HBase because there are only specific values that I want to write to the HBase table, others I want to continue writing to HDFS.
Has anyone tried something similar and can provide pointers?
I've looked at HBase mapreduce: write into HBase in Reducer but my question is different since I don't want to write anything to HBase in mapper.
I have written a map-reduce job for the data in HBase. It contains multiple mappers and just a single reducer. The Reducer method takes in the data supplied from the mapper and do some analytic on it. After the processing is complete for all the data in HBase I wanted to write the data back to a file in HDFS through the single Reducer. Presently I am able to write the data to HDFS every time I get new one but unable to figure how to write the final conclusion to HDFS only at last.
So, if you trying to write a final result from a single reducer to HDFS, you can try any one of the approaches below -
Use Hadoop API FileSystem's create() function to write to HDFS from the reducer.
Emit a single key and value from reducer after the final calculation
Override Reducers cleanup() function and do point (1) there.
Details on 3:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#cleanup-org.apache.hadoop.mapreduce.Reducer.Context-
Hope this helps.
I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.
It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.