How to creating a MapFile with Spark and access it? - hadoop

I am trying to create a MapFile from a Spark RDD, but can't find enough information. Here are my steps so far:
I started with,
rdd.saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
which threw an Exception as the MapFiles must be sorted.
So I modified to:
rdd.sortByKey().saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
which worked fine and my MapFile was created. So the next step was accessing the file. Using the directory name where parts were created failed saying that it cannot find the data file. Back to Google, I found that in order to access the MapFile parts I needed to use:
Object ret = new Object();//My actual WritableComparable impl
Reader[] readers = MapFileOutputFormat.getReaders(new Path(file), new Configuration());
Partitioner<K,V> p = new HashPartitioner<>();
Writable e = MapFileOutputFormat.getEntry(readers, p key, ret);
Naively, I ignored the HashPartioner bit and expected that this would find my entry, but no luck. So my next step was to loop over the readers and do a get(..). This solution did work, but it was extremely slow as the files were created by 128 tasks resulting in 128 part files.
So I investigated the importance of HashPartitioner and found that internally it uses it to identify which reader to use, but it seems that Spark is not using the same partitioning logic. So I modified to:
rdd.partitionBy(new org.apache.spark.HashPartitioner(128)).sortByKey().saveAsNewAPIHadoopFile(....MapFileOutputFormat.class)
But again the 2 HashPartioner did not match. So the questions part...
Is there a way to combine the MapFiles efficiently (as this would ignore the paritioning logic)?
MapFileOutputFormat.getReaders(new Path(file), new
Configuration()); is very slow. Can I identify the reader more
efficiently?
I am using MapR-FS as the underlying DFS. Will this be using the same HashParitioner implementation?
Is there a way to avoid repartitioning, or should the data be sorted over the whole file? (In contrast to being sorted within the partition)
I am also getting an exception _SUCCESS/data does not exist. Do I need to manually delete this file?
Any links about this would be greatly appreciated.
PS. If entries are sorted, then how is it possible to use the HashPartitioner to locate the correct Reader? This would imply that data parts are Hash Partitioned and then Sorted by key. So I also tried rdd.repartiotionAndSortWithinPartitions(new HashPartitioner(280)), but again without any luck.

Digging into the issue, I found that the Spark HashPartitioner and Hadoop HashPartitioner have different logic.
So the "brute force" solution I tried and works is the following.
Save the MapFile using rdd.repartitionAndSortWithinPArtitions(new
org.apache.aprk.HashPartitioner(num_of_parititions)).saveAsNewAPIHadoopFile(....MapFileOutputFormat.class);
Lookup using:
Reader[] readers = MapFileOutputFormat.getReaders(new Path(file),new Configuration());
org.apache.aprk.HashPartitioner p = new org.apache.aprk.HashPartitioner(readers.length);
readers[p.getPartition(key)].get(key,val);
This is "dirty" as the MapFile access is now bound to the Spark partitioner rather than the intuitive Hadoop HashPartitioner. I could implement a Spark partitioner that uses Hadoop HashPartitioner to improve on though.
This also does not address the problem with slow access to the relatively large number of reducers. I could make this even 'dirtier' by generating the file part number from the partitioner but I am looking for a clean solution, so please post if there is a better approach to this problem.

Related

One file database with HDFS and MapReduce

Lets imagine I want to store a big number of urls with associated metadata
URL => Metadata
in a file
hdfs://db/urls.seq
I would like this file to grow (if new URLs are found) after every run of MapReduce.
Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?
The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.
As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left.
One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does.
So there is a trade off it depends what you want whether performance or simplicity.
( Anyways Merry Christmas !)

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Analyse phase of Total Order partitioning

Map Reduce Design Patterns Book
You need to run it only once if the distribution of your data does not change quickly over time, because the value ranges it produces will continue to perform well.
I could not get what is meant by the statement, is this like a general observation or can this actually be implemented when using a TotalOrderPartitioner ?
Can we somehow ask the TotalOrderPartitioner to not create a partitioner file and only use one which has already been created ?
Basically can i skip the analyse phase when using a TotalOrderPartitioner ?
It can easily be implemented when using a TotalOrderPartitioner:
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionFile); // use existing file!!!
// InputSampler.writePartitionFile(job, sampler); // Just comment out this line!!!
Pay attention, from the javadoc:
public static void setPartitionFile(Configuration conf,
Path p)
// Set the path to the SequenceFile storing the sorted partition keyset.
It must be the case that for R reduces, there are R-1 keys in the SequenceFile.
If you re-run sorting - if you data changed slightly and the samples should still well represent it - you can use the existing partition file with the samples, as its creation on the client by InputSampler is expensive. But you have to use the same number of Reducers, as you used in the job for which InputSampler created the partition file.

Hadoop Map-Reduce OutputFormat for assigning result to in-memory variable (not files)?

(from a Hadoop newbie)
I want to avoid files where possible in a toy Hadoop proof-of-concept example. I was able to read data from non-file-based input (thanks to http://codedemigod.com/blog/?p=120) - which generates random numbers.
I want to store the result in memory so that I can do some further (non-Map-Reduce) business logic processing on it. Essetially:
conf.setOutputFormat(InMemoryOutputFormat)
JobClient.runJob(conf);
Map result = conf.getJob().getResult(); // ?
The closest thing that seems to do what I want is store the result in a binary file output format and read it back in with the equivalent input format. That seems like unnecessary code and unnecessary computation (am I misunderstanding the premises which Map Reduce depends on?).
The problem with this idea is that Hadoop has no notion of "distributed memory". If you want the result "in memory" the next question has to be "which machine's memory?" If you really want to access it like that, you're going to have to write your own custom output format, and then also either use some existing framework for sharing memory across machines, or again, write your own.
My suggestion would be to simply write to HDFS as normal, and then for the non-MapReduce business logic just start by reading the data from HDFS via the FileSystem API, i.e.:
FileSystem fs = new JobClient(conf).getFs();
Path outputPath = new Path("/foo/bar");
FSDataInputStream in = fs.open(outputPath);
// read data and store in memory
fs.delete(outputPath, true);
Sure, it does some unnecessary disk reads and writes, but if your data is small enough to fit in-memory, why are you worried about it anyway? I'd be surprised if that was a serious bottleneck.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Resources