How RecommenderJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) will call my custom mappers and reducers? - hadoop

I am running Mahout in Action example for 6 using command:
"hadoop jar target/mia-0.1-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData"
But the mappers and reducers in example of ch 06 are not working ?

You have to change the code to use the custom Mapper and Reducer classes you have in mind. Otherwise yes of course it runs the ones that are currently in the code. Add them, change the caller, recompile, and run it all on Hadoop. I am not sure what you refer to that is not working.

Related

Spark streaming jobs duration in program

How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.

Hadoop passing variables from reducer to main

I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes

Create Snapshot of FS from Spark Job

I would like to create a snapshot of the underlying HDFS, when running a spark job. The particular step involves deleting contents of some parquet files. I want to create a snapshot perform the delete operation, verify the operation results and proceed with next Steps.
However, I am unable to find a good way to access the HDFS API from my spark job. The directory I want to create a snapshot is tagged/marked snapshotable in HDFS. the command line method of creating the snapshot works, However I need to do this programmatically.
i am running Spark 1.5 on CDH 5.5.
any hints clues as to how I can perform this operation ?
Thanks
Ramdev
I have not verified this, but atleast I do not get Compile errors and in theory this solution should work.
This is scala code:
val sc = new SparkContext();
val fs = FileSystem.get(sc.hadoopConfig)
val snapshotPath = fs.createSnapshot("path to createsnapshot of","snapshot name")
.....
.....
if (condition satisfied) {
fs.deleteSnapshot(snapshotPath,"snapshot name")
}
I assume this will work in theory.

Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.

error in hadoop mapreduce program

I am trying to write data from hbase to hdfs and encountered this error in compilation. Is it problem with the reducer code or something else?
HbaseFile.java:36: setReducerClass(java.lang.Class) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Class)
job.setReducerClass(CountWordReducer.class);
^
HbaseFile.java:38: setOutputPath(org.apache.hadoop.mapred.JobConf,org.apache.hadoop.fs.Path) in org.apache.hadoop.mapred.FileOutputFormat cannot be applied to (org.apache.hadoop.mapreduce.Job,org.apache.hadoop.fs.Path)
FileOutputFormat.setOutputPath(job, new Path(args[0]));
From the packages you are using you are mixing the older and new api. To fix this problem you will have to pick one and use it consistently.
Notice your Job is the new api org.apache.hadoop.mapreduce.Job. But you're trying to use the old api to set the outputpath, I can tell because it takes the old JobConf org.apache.hadoop.mapred.JobConf.
If you see "org.apache.hadoop.mapreduce", and "org.apache.hadoop.mapred" in your code at the same time, you are probably mixing the api's and should change things around to pick just one.

Resources