I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes
How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.
My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
//few more translformations
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.
Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions.
1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client machine, and hence don't want to disturb the Hadoop log4j setup in the cluster. I would think having the property file in the project/jar should suffice, and hadoop's distributed cache should do the rest transferring the map-reduce jar.
2) How to log messages to a custom file in $HADOOP_HOME/logs/userlogs/job_/ dir.
3) Will map reduce task use both the log4j property file? the one supplied by the client job and the one present in the hadoop cluster? If yes, then the log4j.rootLogger would add both the property values?
Srivatsan Nallazhagappan
You can configure log4j directly in your code. For example you can call PropertyConfigurator.configure(properties); e.g. in mapper/reducer setup method.
This is example with properties stored on hdfs:
InputStream is = fs.open(log4jPropertiesPath);
Properties properties = new Properties();
where fs is FileSystem object and log4jPropertiesPath is path on hdfs.
With this you can also output logs to a dir with job_id. For example you can modify our properities before calling PropertyConfigurator.configure(properties);
Enumeration propertiesNames = properties.propertyNames();
while (propertiesNames.hasMoreElements()) {
String propertyKey = (String) propertiesNames.nextElement();
String propertyValue = properties.getProperty(propertyKey);
if (propertyValue.indexOf(JOB_ID_PATTERN) != -1) {
properties.setProperty(propertyKey, propertyValue.replace(JOB_ID_PATTERN, context.getJobID().toString()));
There is no straight forward way to override the log4j properties at each job level.
Map Reduce job itself doesn't store the logs in Hadoop,it writes logs in local file system(${hadoop.log.dir}/userlogs) of the datanodes. There is a separate process from Yarn called log-aggregation which collect those logs and combines.
Use yarn logs --applicationId <appId> to fetch the full log, then use unix command to parse and extract the part of the log you need.
I am running Mahout in Action example for 6 using command:
"hadoop jar target/mia-0.1-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData"
But the mappers and reducers in example of ch 06 are not working ?
You have to change the code to use the custom Mapper and Reducer classes you have in mind. Otherwise yes of course it runs the ones that are currently in the code. Add them, change the caller, recompile, and run it all on Hadoop. I am not sure what you refer to that is not working.
How does the task tracker gets its data for map task from another node in case if data is not-local?
Does it talk directly to the data node of the machine containing data directly or it talks to its own data node which in-turn talks to the other one?
The task tracker itself doesn't get the data - it launches (or reuses) a JVM to run a Map task. The map task uses the DFS File System client to query the name node for the block locations of the file it is to process. The client then connects to the data node where one of the blocks is replicated to actually acquire the file contents (as a stream).
If you want to delve deeper, the source is an excellent place to get a good understanding - check out the DFSClient and inner class DFSInputStream (especially the bestNode method)
Class starts around line 1443
openInfo() method # line 1494
chooseDataNode() method # 1800