Hadoop new API - Set OutputFormat - hadoop

I'm trying to set the OutputFormat of my job to MapFileOutputFormat using:
jobConf.setOutputFormat(MapFileOutputFormat.class);
I get this error: mapred.output.format.class is incompatible with new reduce API mode
I suppose I should use the set setOutputFormatClass() of the new Job class but the problem is that when I try to do this:
job.setOutputFormatClass(MapFileOutputFormat.class);
it expects me to use this class: org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.
In hadoop 1.0.X there is no such class. It only exists in earlier versions (e.g 0.x)
How can I solve this problem ?
Thank you!

This problem has no decently easily implementable solution.
I gave up and used Sequence files which fit my requirements too.

Have you tried the following?
import org.apache.hadoop.mapreduce.lib.output;
...
LazyOutputFormat.setOutputFormatClass(job, MapFileOutputFormat.class);

Related

Elastic Search and Spark

I am trying to set up spark and Eleastic search using the elasticsearch-spark library with the sbt artifact: "org.elasticsearch" %% "elasticsearch-spark" % "2.3.2". When I try to configure eleastic search with this code:
val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
.set("es.index.auto.create", "true")
.set("es.resource", "test")
.set("es.nodes", "test.com:9200")
I keep getting the error: illegal character for all of the set statements above for elastic search. Anyone know the issue?
You must have copied the code from any website or any other blog. It contains unreadable characters that are actually giving you trouble.
Simple solution: Delete all the content. Type one by one manually, and run it.. Let me know if you faced any problems again, i will help you out.
You might want to set the http.publish_host in your elasticsearch.yml to HOST_NAME. The es-hadoop connector is sniffing the nodes from the _nodes/transport API so it checks what the published http address is.

sparkR 1.4.0 : how to include jars

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.apache.hadoop.io.NullWritable')
But for others, it does not:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.elasticsearch.hadoop.mr.LinkedMapWritable')
Yielding the error:
java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat
It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.
Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.
Thanks!
Ben
Here's how I've achieved it:
# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)
(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)
The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.

How to get all versions of an hbase cell in a spark newAPIHadoopRDD?

I know when you use the Get API you can set MAX_VERSION_COUNT to get all versions of a cell. But I didn' t find any documentation on how to get all versions of cell with a map operation of spark newAPIHadoopRDD. I' ve tried with a naive result.getColumnCells() and it returns only 1 result. How can I set MAX_VERSION_COUNT in spark?
After taking a look at source code of TableInputFormat I found it reads configuration from hbase.mapreduce.scan.maxversions. So setting it like this works:
val conf = HBaseConfiguration.create()
conf.set("hbase.mapreduce.scan.maxversions", "VERSION_YOU_WANT")
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

Scalding + LZO +Protobuf

Are there any pointers to get Scalding to work with LZO Protobuf data on HDFS?
I am trying to read files that are stored in binary Protobuf and compressed in LZO using Scalding.
Can we use Elephantbird to read those files? Any pointers will be appreciated!
I have looked at the LzoTraits and LzoProtobufScheme? But I am not sure how I should be using it to read the data? Any examples would be great!
Here is an example:
case class SomeProto() extends FixedPathSource("/my/greatData/*")
with LzoProtobuf[MyProtoClassHere] {
override def column = classOf[MyProtoClassHere]
}
You can mix with other types of abstract base Sources (like TimePathedSource, or MostRecentGoodSource) in a similar way. You can mix in with LocalTapSource if you want to use the Hadoop-inside-cascading-local trick (if you don't run in cascading local mode, you don't need this).

How to use Hadoop API copyMerge function? What is the addString parameter?

Does anyone know or have used copyMerge function in Hadoop API - FileUtil?
copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String addString);
In the function, what is the addString parameter? How do I set how those files are merged? Example I have part number 1,2,3,4,5..., I want to combine them into one file in ascending order, how can I do it?
Detail about the API: http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/fs/FileUtil.html
Thanks!
Looks like the the addString is just written to the OutputStream in the FileUtil class
if (addString!=null)
out.write(addString.getBytes("UTF-8"));
}
When there is no documentation, source code is the true and best source for details. I have written a few articles on how to setup Git here and here. Git helps for faster and easier access to the code.

Resources