Map Reduce Design Patterns Book
You need to run it only once if the distribution of your data does not change quickly over time, because the value ranges it produces will continue to perform well.
I could not get what is meant by the statement, is this like a general observation or can this actually be implemented when using a TotalOrderPartitioner ?
Can we somehow ask the TotalOrderPartitioner to not create a partitioner file and only use one which has already been created ?
Basically can i skip the analyse phase when using a TotalOrderPartitioner ?
It can easily be implemented when using a TotalOrderPartitioner:
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionFile); // use existing file!!!
// InputSampler.writePartitionFile(job, sampler); // Just comment out this line!!!
Pay attention, from the javadoc:
public static void setPartitionFile(Configuration conf,
Path p)
// Set the path to the SequenceFile storing the sorted partition keyset.
It must be the case that for R reduces, there are R-1 keys in the SequenceFile.
If you re-run sorting - if you data changed slightly and the samples should still well represent it - you can use the existing partition file with the samples, as its creation on the client by InputSampler is expensive. But you have to use the same number of Reducers, as you used in the job for which InputSampler created the partition file.
Related
I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1' one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()? Does it has any performance issue?
There are two strategies that I can think of:
1. One version + inverted timestamp
Setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() will generally work and does not have any major performance issues.
On write:
When multiple versions of the same cell are written to hbase, at any point in time, all versions will be written (without any impact on performance). After compaction only the cell with the highest timestamp will survive.
The cell with the highest timestamp in this scheme is the one written by the client with the lowest value for System.currentTimeMillis(). It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync.
On read:
When multiple versions of the same cell are found pruning will occur at that time. This can happen at any time, since your writes can occur at any time, even after compaction. This has a very slight impact on performance.
2. checkAndPut
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected
value. If it does, it adds the put. If the passed value is null, the
check is for the lack of column (ie: non-existance)`
So by setting value to null your Put will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
A row lock is set and a Get is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.
During compaction nothing needs to happen, because only one Put will ever make it to hbase. Which is always the first Put to reach the region server.
It should be noted that there is no way to batch these kind of checkAndPut operations by using checkAndMutate, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.
On read:
Only ever one version will make it to Hbase, so there is no impact here.
Picking between strategies:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.
You can insert the Put with Long.MAX_VALUE - timestampand configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.
I am trying to create a MapFile from a Spark RDD, but can't find enough information. Here are my steps so far:
I started with,
rdd.saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
which threw an Exception as the MapFiles must be sorted.
So I modified to:
rdd.sortByKey().saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
which worked fine and my MapFile was created. So the next step was accessing the file. Using the directory name where parts were created failed saying that it cannot find the data file. Back to Google, I found that in order to access the MapFile parts I needed to use:
Object ret = new Object();//My actual WritableComparable impl
Reader[] readers = MapFileOutputFormat.getReaders(new Path(file), new Configuration());
Partitioner<K,V> p = new HashPartitioner<>();
Writable e = MapFileOutputFormat.getEntry(readers, p key, ret);
Naively, I ignored the HashPartioner bit and expected that this would find my entry, but no luck. So my next step was to loop over the readers and do a get(..). This solution did work, but it was extremely slow as the files were created by 128 tasks resulting in 128 part files.
So I investigated the importance of HashPartitioner and found that internally it uses it to identify which reader to use, but it seems that Spark is not using the same partitioning logic. So I modified to:
rdd.partitionBy(new org.apache.spark.HashPartitioner(128)).sortByKey().saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
But again the 2 HashPartioner did not match. So the questions part...
Is there a way to combine the MapFiles efficiently (as this would ignore the paritioning logic)?
MapFileOutputFormat.getReaders(new Path(file), new
Configuration()); is very slow. Can I identify the reader more
efficiently?
I am using MapR-FS as the underlying DFS. Will this be using the same HashParitioner implementation?
Is there a way to avoid repartitioning, or should the data be sorted over the whole file? (In contrast to being sorted within the partition)
I am also getting an exception _SUCCESS/data does not exist. Do I need to manually delete this file?
Any links about this would be greatly appreciated.
PS. If entries are sorted, then how is it possible to use the HashPartitioner to locate the correct Reader? This would imply that data parts are Hash Partitioned and then Sorted by key. So I also tried rdd.repartiotionAndSortWithinPartitions(new HashPartitioner(280)), but again without any luck.
Digging into the issue, I found that the Spark HashPartitioner and Hadoop HashPartitioner have different logic.
So the "brute force" solution I tried and works is the following.
Save the MapFile using rdd.repartitionAndSortWithinPArtitions(new
org.apache.aprk.HashPartitioner(num_of_parititions)).saveAsNewAPIHadoopFile(....MapFileOutputFormat.class);
Lookup using:
Reader[] readers = MapFileOutputFormat.getReaders(new Path(file),new Configuration());
org.apache.aprk.HashPartitioner p = new org.apache.aprk.HashPartitioner(readers.length);
readers[p.getPartition(key)].get(key,val);
This is "dirty" as the MapFile access is now bound to the Spark partitioner rather than the intuitive Hadoop HashPartitioner. I could implement a Spark partitioner that uses Hadoop HashPartitioner to improve on though.
This also does not address the problem with slow access to the relatively large number of reducers. I could make this even 'dirtier' by generating the file part number from the partitioner but I am looking for a clean solution, so please post if there is a better approach to this problem.
I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...
I have a file in which a set of every four lines represents a record.
eg, first four lines represent record1, next four represent record 2 and so on..
How can I ensure Mapper input these four lines at a time?
Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don't get span across multiple split files..
How can this be done?
A few approaches, some dirtier than others:
The right way
You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.
Another right way, which may not be possible
The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.
Something like:
ALine1
ALine2 ALine1;Aline2;Aline3;Aline4
ALine3
ALine4 ->
BLine1
BLine2 BLine1;Bline2;Bline3;Bline4;
BLine3
BLine4
Down and Dirty
Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.
Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.
Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And as orangeoctopus said
In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
This has some overhead for the following reasons
Time to process the largest file drags the job completion time.
A lot of data may be transferred between the data nodes.
The cluster is not properly utilized, since # of maps = # of files.
** The above code is from Hadoop : The Definitive Guide
I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.
The key values can change and are text strings.
Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.
Except, this doesn't seem to work because the keys don't get mapped to specific reducers.
The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.
I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.
Any suggestions would be much appreciated.
Examples:
mapper output:
csv2\tfieldA,fieldB,fieldC
csv1\tfield1,field2,field3,field4
csv3\tfieldRed,fieldGreen
...
the problem is that if i have 3 reducers i end up with key distribution like this:
reducer1 reducer2 recuder3
csv1 csv2
csv3
one reducer gets two different key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.
I'm pretty sure MultipleOutputFormat [1] can be used under streaming. That'll solve most of your problems.
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
If you are stuck with streaming, and can't include any external jars for a custom partitioner, then this is probably not going to work the way you want it to without some hacks.
If these are absolute requirements, you can get around this, but it's messy.
Here's what you can do:
Hadoop, by default, uses a hashing partitioner, like this:
key.hashCode() % numReducers
So you can pick keys such that they hash to 1, 2, and 3 (or three numbers such that x % 3 = 1, 2, 3). This is a nasty hack, and I wouldn't suggest it unless you have no other options.
If you want custom output to different csv files you can direct write (with API) to hdfs. As you know hadoop passes with key and associated value list to single reduce task. In reduce code, check , while key is same write to same file. If another key comes, create new file manually and write into it. It does not matter how many reducers you have