The following dependency is in the pom:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.0</version>
</dependency>
I expect the jar to contain the following class:
org.apache.spark.sql.api.java.JavaSQLContext
but while it contains the package org.apache.spark.sql.api.java, all that package appears to contain are interfaces named UDF1- UDSF22.
Which is the correct dependency to get JavaSQLContext?
Thanks.
The JavaSQLContext class has been removed from version 1.3.0 onwards. You should use org.apache.spark.sql.SQLContext class instead. The documentation states the following:
Prior to Spark 1.3 there were separate Java compatible classes (JavaSQLContext and JavaSchemaRDD) that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users of either language should use SQLContext and DataFrame. In general theses classes try to use types that are usable from both languages (i.e. Array instead of language specific collections). In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading is used instead.
Additionally the Java specific types API has been removed. Users of both Scala and Java should use the classes present in org.apache.spark.sql.types to describe schema programmatically.
As an aside if you want to search which jars contain a specific class you can use the Advanced Search of Maven Central and search "By Classname". So here is the search for JavaSQLContext:
http://search.maven.org/#search|ga|1|fc%3A%22org.apache.spark.sql.api.java.JavaSQLContext%22
From a cursory search, it appears that the class org.apache.spark.sql.api.java.JavaSQLContext only appears in the 1.2 versions and earlier of the spark-sql JAR file. It is likely that the code with which you are working is also using this older dependency. You have two choices at this point: you can either upgrade your code usage, or you can downgrade the spark-sql JAR. You probably want to go with the former option.
If you insist on keeping your code the same, then including the following dependency in your POM should fix the problem:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.2.2</version>
</dependency>
If you want to upgrade your code, see the answer given by #DB5
I had the same problem, and it was because I was looking at the wrong version of the documentation.
My understanding from the latest documentation - https://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically - is to use something like this (copied from the doc):
SQLContext sqlContext = null; // Determine;
DataFrame schemaPeople = null; // The DataFrame from the previous example.
// DataFrames can be saved as Parquet files, maintaining the schema information.
schemaPeople.write().parquet("people.parquet");
// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a DataFrame.
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
// Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
DataFrame teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");
List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
Related
A few months ago I started working on a project that requires the integration of OPC / UA to communicate with an automatic machine. Working with SpringBoot I looked for a library that was well integrated with this framework and in several posts and thesis I found Eclipse Milo, but in the version integrated with Apache Camel. Not knowing either Camel or Milo, I had to study both at least.
Camel has a huge documentation, while the integration with Milo is limited to the parameterization and configuration of the Nodes to perform the reading and writing. That said it seems more than enough but in reality, since there are no specific examples, I had to search for posts several times to understand where I was wrong and clearly it took a lot of time.
Now for example, I was able to run the reads and writes correctly while the function calls have a strange behavior, that is, every time I call the test function, the value that is returned to me is the parameter that I give in input, even if, enabling the TRACE on Camel and Milo I see that the function is called correctly and in the OutputArguments there is the result I expect but Camel keeps returning the InputArguments. It's certainly my mistake but I can't find anything to help me understand where I'm wrong. Is the choice I made the right one? I don't know what else to try.
Here the test simplified code I'm trying to do:
Variant[] params = new Variant[1];
params[0] = new Variant(13);
String endpointUri = "milo-client:opc.tcp://milo.digitalpetri.com:62541/milo?node=RAW(ns=2;s=Methods)"&method=RAW(ns=2;s=Methods/sqrt(x))";
return producerTemplate.requestBody(endpointUri, params, "await", true, Variant.class);
The returned object is the same as I input, even if looking at the log I see that the function call is executed correctly:
2021-mag-20 11:14:07.613 TRACE [milo-netty-event-loop-1] o.e.m.o.s.c.t.t.OpcTcpTransport - Write succeeded for request=PublishRequest, requestHandle=16
2021-mag-20 11:14:07.598 DEBUG [milo-shared-thread-pool-1] o.a.c.c.m.c.i.SubscriptionManager - Call to node=ExpandedNodeId{ns=2, id=Methods, serverIndex=0}, method=ExpandedNodeId{ns=2, id=Methods/sqrt(x), serverIndex=0} = [Variant{value=13.0}]-> CallMethodResult{StatusCode=StatusCode{name=Good, value=0x00000000, quality=good}, InputArgumentResults=[StatusCode{name=Good, value=0x00000000, quality=good}], InputArgumentDiagnosticInfos=[], OutputArguments=[Variant{value=3.605551275463989}]}
This are my dependencies :
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-spring-boot-starter</artifactId>
<version>3.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-milo-starter</artifactId>
<version>3.9.0</version>
</dependency>
This is my repository in github: https://github.com/joedayz/lazybones-templates/
I used processTemplates according with the documentation
processTemplates 'build.gradle', props
processTemplates 'gradle.properties', props
processTemplates 'src/main/java/*.java', props
processTemplates 'settings.gradle', props
I request the user this information:
props.project_megaproceso = ask("Define value for 'megaproceso' [megaproceso]: ", "megaproceso", "megaproceso")
props.project_macroproceso = ask("Define value for 'macroproceso' [macroproceso]: ", "macroproceso", "macroproceso")
props.project_proceso = ask("Define value for 'proceso' [proceso]: ", "proceso", "proceso")
megaproceso2, macroproceso, proceso are directories or part of file names in my template.
How do I change the names of the unpacked directories and files? The code is in my github.
The post-install scripts for Lazybones currently have full access to both the standard JDK classes and the Apache Commons IO library, specifically to aid with file manipulation.
In this specific case, you can either use File.renameTo() or FileUtils.moveFile/Directory(). For example:
def prevPath = new File(projectDir, "megaproceso2-macroproceso-proceso.ear")
prevPath.renameTo(new File(
projectDir,
"${props.megaproceso}-${props.macroproceso}-${props.processo}.ear"))
The projectDir variable is one of several properties injected into the post-install script. You can find a list of them in the Template Developers Guide.
I think the main advantage of FileUtils.moveFile() is that it works even if you're moving files across devices, but that's not necessary here. Also note that you have to explicitly import the classes from Commons IO if you want to use them.
I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.
I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.apache.hadoop.io.NullWritable')
But for others, it does not:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.elasticsearch.hadoop.mr.LinkedMapWritable')
Yielding the error:
java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat
It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.
Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.
Thanks!
Ben
Here's how I've achieved it:
# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)
(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)
The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.
For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts of it):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data; returns Unable to open iterator for alias raw_data and ILLUSTRATE raw_data; returns No (valid) input data found!.
For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.
Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).
I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.
Updates:
I didn't use the generation of protobuf loader classes, but use the reflection type loader
The protobuf classes are in a jar that is registered. DESCRIBE correctly shows the types.