In Hadoop I have a Reducer that looks like this to transform data from a prior mapper into a series of files of a non InputFormat compatible type.
protected void setup(Context context) {
LocalDatabase ld = new LocalDatabase("localFilePath");
}
protected void reduce(BytesWritable key, Text value, Context context) {
ld.addValue(key, value)
}
protected void cleanup(Context context) {
saveLocalDatabaseInHDFS(ld);
}
I was rewriting my application in Pig, and can't figure out how this would be done in a Pig UDF as there's no cleanup function or anything else to denote when the UDF has finished running. How can this be done in pig?
I would say you'd need to write a StoreFunc UDF, wrapping your own custom OutputFormat - then you'd have the ability to close out in the Output Format's RecordWriter.close() method.
This will create an database in HDFS for each reducer however, so if you want everything in a single file, you'd need to run with a single reducer or run a secondary step to merge the databases together.
http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
If you want something to run at the end of your UDF, use the finish() call. This will be called after all records have been processed by your UDF. It will be called once per mapper or reducer, the same as the cleanup call in your reducer.
Related
I'm having trouble understanding how Spark interacts with storage.
I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.
What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.
I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.
Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.
Classes
/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
public KeyValueIterator() {
//TODO initialize your custom reader using java client library
}
#Override
public boolean hasNext() {
//TODO
}
#Override
public KeyValuePair next() {
//TODO
}
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
#Override
public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
//ignore empty 'keyValuePair' object
return new KeyValueIterator();
}
}
Create KeyValue RDD
/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());
Note:
Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors.
Different ways to increase partitions:
keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)
I want to store some value in map task into local disk in each data node. For example,
public void map (...) {
//Process
List<Object> cache = new ArrayList<Object>();
//Add value to cache
//Serialize cache to local file in this data node
}
How can I store this cache object to local disk in each data node, because if I store this cache in map function like above, then the performance will be terrible because I/O task?
I mean is there any way to wait for map task in this data node run completely and then we will store this cache into local disk? Or does Hadoop have a function to solve this issue?
Please see below example, the created file will be somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp
Please see #Chris Nauroth answer,
Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}
I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library:
https://github.com/tobie/ua-parser
The UDF is rather simple:
public class DeviceTypeExtractTest extends UDF{
private Text result = new Text();
private static final Parser uaParser;
static {
try {
uaParser = new Parser();
}
catch(IOException e) {
throw new RuntimeException("Could not instantiate User-Agent parser.");
}
}
public Text evaluate( Text uaField){
if (uaField == null ) {
return null;
}
try
{
String uaString = uaField.toString();
Client client = uaParser.parse(uaString);
result.set(client.device.family);
return result;
}
catch(Exception e)
{
return null;
}
}
}
And it works just fine when run on a small dataset.
create table categories(
cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;
However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt:
http://pastebin.com/yK6Qmx6r
And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.
I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).
I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.
Any ideas?
Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.
Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.
Problem Solved Eventually check my solution in the bottom
Recently I am trying to run the recommender example in the chaper6 (listing 6.1 ~ 6.4)from the Mahout in Action. But I encountered a problem and I have googled around but I can't find the solution.
Here is the problem: I have a pair of mapper-reducer
public final class WikipediaToItemPrefsMapper extends
Mapper<LongWritable, Text, VarLongWritable, VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
public class WikipediaToUserVectorReducer
extends
Reducer<VarLongWritable, VarLongWritable, VarLongWritable, VectorWritable> {
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs, Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int) itemPref.get(), 1.0f);
}
context.write(userID, new VectorWritable(userVector));
}
}
The reducer output a userID and a userVector and it looks like this:
98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0}
Then I want to use another pair of mapper-reducer to process this data
public class UserVectorSplitterMapper
extends
Mapper<VarLongWritable, VectorWritable, IntWritable, VectorOrPrefWritable> {
public void map(VarLongWritable key, VectorWritable value, Context context)
throws IOException, InterruptedException {
long userID = key.get();
Vector userVector = value.get();
Iterator<Vector.Element> it = userVector.iterateNonZero();
IntWritable itemIndexWritable = new IntWritable();
while (it.hasNext()) {
Vector.Element e = it.next();
int itemIndex = e.index();
float preferenceValue = (float) e.get();
itemIndexWritable.set(itemIndex);
context.write(itemIndexWritable,
new VectorOrPrefWritable(userID, preferenceValue));
}
}
}
When I try to run the job, it cast error says
org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
the first mapper-reducer write the output into the hdfs, and the second mapper-reducer try to read the output, the mapper can cast the 98955 to VarLongWritable, but can't convert
{590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} to VectorWritable, So I am wondering is there a way to make the first mapper-reducer directly send the output to the second pair, then there is no need to do the data converting. I have looked up Hadoop in action, and hadoop: the definitive guide, it seems there is no such a way to do that, any suggestions?
Problem solved
Solution: By using SequenceFileOutputFormat, we can output and save the reduce result of the first MapReduce workflow on the DFS, then the second MapReduce workflow can read the temporary file as input by using SequenceFileInputFormat class as parameter when creating the mapper. Since the vector would be saved in binary sequence file which has specific format, the SequenceFileInputFormat can read it and transform it back to vector format.
Here are some example code:
confFactory ToItemPrefsWorkFlow = new confFactory
(new Path("/dbout"), //input file path
new Path("/mahout/output.txt"), //output file path
TextInputFormat.class, //input format
VarLongWritable.class, //mapper key format
Item_Score_Writable.class, //mapper value format
VarLongWritable.class, //reducer key format
VectorWritable.class, //reducer value format
**SequenceFileOutputFormat.class** //The reducer output format
);
ToItemPrefsWorkFlow.setMapper( WikipediaToItemPrefsMapper.class);
ToItemPrefsWorkFlow.setReducer(WikipediaToUserVectorReducer.class);
JobConf conf1 = ToItemPrefsWorkFlow.getConf();
confFactory UserVectorToCooccurrenceWorkFlow = new confFactory
(new Path("/mahout/output.txt"),
new Path("/mahout/UserVectorToCooccurrence"),
SequenceFileInputFormat.class, //notice that the input format of mapper of the second work flow is now SequenceFileInputFormat.class
//UserVectorToCooccurrenceMapper.class,
IntWritable.class,
IntWritable.class,
IntWritable.class,
VectorWritable.class,
SequenceFileOutputFormat.class
);
UserVectorToCooccurrenceWorkFlow.setMapper(UserVectorToCooccurrenceMapper.class);
UserVectorToCooccurrenceWorkFlow.setReducer(UserVectorToCooccurrenceReducer.class);
JobConf conf2 = UserVectorToCooccurrenceWorkFlow.getConf();
JobClient.runJob(conf1);
JobClient.runJob(conf2);
If you have any problem with this please feel free to contact me
You need to explicitly configure the output of the first job to use the SequenceFileOutputFormat and define the output key and value classes:
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(VarLongWritable.class);
job.setOutputKeyClass(VectorWritable.class);
Without seeing your driver code, i'm guessing you're using TextOutputFormat as the output, of the first job, and TextInputFormat as the input to the second - and this input format sends pairs of <Text, Text> to the second mapper
I am a beginner in hadoop, it's just my guess of the answer, so please bear with it/point out if it seems to be naive.
I think it's not reasonable to send from reducer to next mapper without saving on HDFS.
Because "which split of data go to which mapper" is elegantly designed to meet the locality criteria.(goes to mapper node which have data stored locally).
If you don't store it on HDFS, most likely that all the data will be transmitted by network which is slow and may cause bandwidth problem.
You have to temporarily save output of first map-reduce so that the 2nd one can use it.
This might help you to understand how the output of first map-reduce is passed to 2nd one. (this is based on the Generator.java of Apache nutch).
This is the temporary dir for the output of first map-reduce:
Path tempDir =
new Path(getConf().get("mapred.temp.dir", ".")
+ "/job1-temp-"
+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
Setting up first map-reduce job:
JobConf job1 = getConf();
job1.setJobName("job 1");
FileInputFormat.addInputPath(...);
sortJob.setMapperClass(...);
FileOutputFormat.setOutputPath(job1, tempDir);
job1.setOutputFormat(SequenceFileOutputFormat.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(...);
JobClient.runJob(job1);
Observe that the output dir is set in the job configuration. Use this in the 2nd job:
JobConf job2 = getConf();
FileInputFormat.addInputPath(job2, tempDir);
job2.setReducerClass(...);
JobClient.runJob(job2);
Remember to clean-up the temp dirs after your are done:
// clean up
FileSystem fs = FileSystem.get(getConf());
fs.delete(tempDir, true);
Hope this helps.
I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
Actually stdout only shows the System.out.println() of the non-map reduce classes.
The System.out.println() for map and reduce phases can be seen in the logs. Easy way to access the logs is
http://localhost:50030/jobtracker.jsp->click on the completed job->click on map or reduce task->click on tasknumber->task logs->stdout logs.
Hope this helps
Another way is through the terminal:
1) Go into your Hadoop_Installtion directory, then into "logs/userlogs".
2) Open your job_id directory.
3) Check directories with _m_ if you want the mapper output or _r_ if you're looking for reducers.
Example: In Hadoop-20.2.0:
> ls ~/hadoop-0.20.2/logs/userlogs/attempt_201209031127_0002_m_000000_0/
log.index stderr stdout syslog
The above means:
Hadoop_Installation: ~/hadoop-0.20.2
job_id: job_201209031127_0002
_m_: map task , "map number": _000000_
4) open stdout if you used "system.out.println" or stderr if you used "system.err.append".
PS. other hadoop versions might have a sight different hierarchy but they're all should be under $Hadoop_Installtion/logs/userlogs.
On a Hadoop cluster with yarn, you can fetch the logs, including stdout, with:
yarn logs -applicationId application_1383601692319_0008
For some reason, I've found this to be more complete than what I see in the webinterface. The webinterface did not list the output of System.out.println() for me.
to get your stdout and log message on the console you can use apache commons logging framework in to your mapper and reducer.
public class MyMapper extends Mapper<..,...,..,...> {
public static final Log log = LogFactory.getLog(MyMapper.class)
public void map() throws Exception{
// Log to stdout file
System.out.println("Map key "+ key);
//log to the syslog file
log.info("Map key "+ key);
if(log.isDebugEanbled()){
log.debug("Map key "+ key);
}
context.write(key,value);
}
}
After most of the options above did not work for me, I realized that on my single node cluster, I can use this simple method:
static private PrintStream console_log;
static private boolean node_was_initialized = false;
private static void logPrint(String line){
if(!node_was_initialized){
try{
console_log = new PrintStream(new FileOutputStream("/tmp/my_mapred_log.txt", true));
} catch (FileNotFoundException e){
return;
}
node_was_initialized = true;
}
console_log.println(line);
}
Which, for example, can be used like:
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
logPrint("map input: key-" + key.toString() + ", value-" + value.toString());
//actual impl of 'map'...
}
After that, the prints can be viewed with: cat /tmp/my_mapred_log.txt.
To get rid of prints from prior hadoop runs you can simple use rm /tmp/my_mapred_log.txt before running hadoop again.
notes:
The solution by Rajkumar Singh is likely better if you have the time to download and integrate a new library.
This could work for multi-node clusters if you have a way to access "/tmp/my_mapred_log.txt" on each worker node machine.
If for some strange reason you already have a file named "/tmp/my_mapred_log.txt", consider changing the name (just make sure to give an absolute path).