Why does job submission from Java fail? - spring

I submit a Spark job from Java as a RESTful service. I keep getting the following error:
Application application_1446816503326_0098 failed 2 times due to AM
Container for appattempt_1446816503326_0098_000002 exited with
exitCode: -1000 For more detailed output, check application tracking
page:http://ip-172-31-34-
108.us-west-2.compute.internal:8088/proxy/application_1446816503326_0098/Then,
click on links to logs of each attempt. Diagnostics:
java.io.FileNotFoundException: File file:/opt/apache-tomcat-
8.0.28/webapps/RESTfulExample/WEB-INF/lib/spark-yarn_2.10-1.3.0.jar does not exist Failing this attempt. Failing the application.
spark-yarn_2.10-1.3.0.jar file is there in the lib folder.
Here is my program.
package SparkSubmitJava;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import java.io.IOException;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.PathParam;
import javax.ws.rs.core.Response;
#Path("/spark")
public class JavaRestService {
#GET
#Path("/{param}/{param2}/{param3}")
public Response getMsg(#PathParam("param") String bedroom,#PathParam("param2") String bathroom,#PathParam("param3")String area) throws IOException {
String[] args = new String[] {
"--name",
"JavaRestService",
"--driver-memory",
"1000M",
"--jar",
"/opt/apache-tomcat-8.0.28/webapps/scalatest-0.0.1-SNAPSHOT.jar",
"--class",
"ScalaTest.ScalaTest.ScalaTest",
"--arg",
bedroom,
"--arg",
bathroom,
"--arg",
area,
"--arg",
"yarn-cluster",
};
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
Client client = new Client(cArgs, config, sparkConf);
client.run();
return Response.status(200).entity(client).build();
}
}
Any help will be appreciated.

Related

Writing data from JavaDStream<String> in Apache spark to elasticsearch

I am working on programming to process data from Apache kafka to elasticsearch. For that purpose I am using Apache Spark. I have gone through many link but unable to find example to write data from JavaDStream in Apache spark to elasticsearch.
Below is sample code of spark which gets data from kafka and prints it.
import org.apache.log4j.Logger;
import org.apache.log4j.Level;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import com.google.common.collect.ImmutableMap;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.List;
public class SparkStream {
public static JavaSparkContext sc;
public static List<Map<String, ?>> alldocs;
public static void main(String args[])
{
if(args.length != 2)
{
System.out.println("SparkStream <broker1-host:port,broker2-host:port><topic1,topic2,...>");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf sparkConf=new SparkConf().setAppName("Data Streaming");
sparkConf.setMaster("local[2]");
sparkConf.set("es.index.auto.create", "true");
sparkConf.set("es.nodes","localhost");
sparkConf.set("es.port","9200");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet=new HashSet<>(Arrays.asList(args[1].split(",")));
Map<String,String> kafkaParams=new HashMap<>();
String brokers=args[0];
kafkaParams.put("metadata.broker.list",brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("offsets.storage", "zookeeper");
JavaPairDStream<String, String> messages=KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
lines.print();
jssc.start();
jssc.awaitTermination();
}
}
`
One method to save to elastic search is using the saveToEs method inside a foreachRDD function. Any other method you wish to use would still require the foreachRDD call to your dstream.
For example:
lines.foreachRDD(lambda rdd: rdd.saveToEs("ESresource"))
See here for more
dstream.foreachRDD{rdd=>
val es = sqlContext.createDataFrame(rdd).toDF("use headings suitable for your dataset")
import org.elasticsearch.spark.sql._
es.saveToEs("wordcount/testing")
es.show()
}
In this code block "dstream" is the data stream which observe data from server like kafka. Inside brackets of "toDF()" you have to use headings. In "saveToES()" you have use elasticsearch index. Before this you have create SQLContext.
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
If you are using kafka to send data you have to add dependency mentioned below
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.1"
Get the dependency
To see full example see
In this example first you have to create kafka producer "test" then start elasticsearch
After run the program. You can see full sbt and code using above url.

I am getting the error "The method addCacheFile(URI) is undefined for the type Job" with CDH4.0

I am getting the error
The method addCacheFile(URI) is undefined for the type Job
with CDH4.0 when trying to call the addCacheFile(URI uri) method, as shown below:
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class DistributedCacheDriver {
public static void main(String[] args) throws Exception {
String inputPath = args[0];
String outputPath = args[1];
String fileName = args[2];
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "TestingDistributedCache");
job.setJarByClass(DistributedCache.class);
job.addCacheFile(new URI(fileName)); //Getting error here -The method addCacheFile(URI) is undefined for the type Job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
Any suggestions/hints to get rid of this error?
If you have chosen to install MapReduce version 1, then you should replace the job.addCacheFile() command with DistributeddCache.addCacheFile(); and change the setup() method accordingly (call it configure()).
Find some official documentation and examples here.

Load Data into Hbase outside Client Node

Thanks in advance.
We are loading data into Hbase using Java. It's pretty straight and works fine when we run the program on the client node (edge node). But we want to run this program remotely (outside the hadoop cluster) within our network to load the data.
Is there anything required to do this in terms of security on the hadoop cluster? When I run the program outside the cluster it's hanging..
Please advise. Greatly appreciate your help.
Thanks
Code here
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import com.dev.stp.cvsLoadEventConfig;
import com.google.protobuf.ServiceException;
public class LoadData {
static String ZKHost;
static String ZKPort;
private static Configuration config = null;
private static String tableName;
public LoadData (){
//Set Application Config
LoadDataConfig conn = new LoadDataConfig();
ZKHost = conn.getZKHost();
ZKPort = conn.getZKPort();
config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", ZKHost);
config.set("hbase.zookeeper.property.clientPort", ZKPort);
config.set("zookeeper.znode.parent", "/hbase-unsecure");
tableName = "E_DATA";
}
//Insert Record
try {
HTable table = new HTable(config, tableName);
Put put = new Put(Bytes.toBytes(eventId));
put.add(Bytes.toBytes("E_DETAILS"), Bytes.toBytes("E_NAME"),Bytes.toBytes("test data 1"));
put.add(Bytes.toBytes("E_DETAILS"), Bytes.toBytes("E_TIMESTAMP"),Bytes.toBytes("test data 2"));
table.put(put);
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}

IndexOutOfBound when trying Sphinx4

I have created all the models required by Sphinx4 (language model, dictionary and acoustic model). I created a Maven project in Eclipse and all the libraries downloaded, but when I run the program as shown in the official website (http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4), an IndexOutOfBounds Exception is thrown.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 768, Size: 768
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at edu.cmu.sphinx.linguist.acoustic.tiedstate.Pool.get(Pool.java:55)
at edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader.createSenonePool(Sphinx3Loader.java:403)
at edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader.loadModelFiles(Sphinx3Loader.java:341)
at edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader.load(Sphinx3Loader.java:278)
at edu.cmu.sphinx.frontend.AutoCepstrum.newProperties(AutoCepstrum.java:118)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:165)
at edu.cmu.sphinx.util.props.PropertySheet.getComponentList(PropertySheet.java:422)
at edu.cmu.sphinx.frontend.FrontEnd.newProperties(FrontEnd.java:160)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.PropertySheet.getComponent(PropertySheet.java:290)
at edu.cmu.sphinx.decoder.scorer.SimpleAcousticScorer.newProperties(SimpleAcousticScorer.java:46)
at edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer.newProperties(ThreadedAcousticScorer.java:130)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.PropertySheet.getComponent(PropertySheet.java:290)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.newProperties(WordPruningBreadthFirstSearchManager.java:201)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.PropertySheet.getComponent(PropertySheet.java:290)
at edu.cmu.sphinx.decoder.AbstractDecoder.newProperties(AbstractDecoder.java:70)
at edu.cmu.sphinx.decoder.Decoder.newProperties(Decoder.java:37)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.PropertySheet.getComponent(PropertySheet.java:290)
at edu.cmu.sphinx.recognizer.Recognizer.newProperties(Recognizer.java:89)
at edu.cmu.sphinx.util.props.PropertySheet.getOwner(PropertySheet.java:508)
at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:165)
at edu.cmu.sphinx.api.Context.<init>(Context.java:73)
at edu.cmu.sphinx.api.Context.<init>(Context.java:44)
at edu.cmu.sphinx.api.AbstractSpeechRecognizer.<init>(AbstractSpeechRecognizer.java:37)
at edu.cmu.sphinx.api.LiveSpeechRecognizer.<init>(LiveSpeechRecognizer.java:33)
at Main.main(Main.java:26)
The source code which I am running is as follows:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
public class Main {
public static void main(String[] args) {
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("Alphabets/acoustic");
configuration.setDictionaryPath("Alphabets/alphabets.dic");
configuration.setLanguageModelPath("Alphabets/alphabets.lm.dmp");
LiveSpeechRecognizer recognizer = null;
try {
recognizer = new LiveSpeechRecognizer(configuration);
} catch (IOException e) {
e.printStackTrace();
}
recognizer.startRecognition(true);
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();
System.out.println(result.getHypothesis());
result.getLattice().dumpDot("lattice.dot", "lattice");
}
}
I highly appreciate the assistance.
You are trying to use sphinx4 with semi-continuous model. You need to train continuous model to use with sphinx4, see for details
http://cmusphinx.sourceforge.net/wiki/tutorialam
You need to set
$CFG_HMM_TYPE = '.cont.'; # Sphinx 4, Pocketsphinx

How do I convert my Java Hadoop code to run on EC2?

I wrote a Driver, Mapper, and Reducer class in Java that runs the k-nearest neighbor algorithm on test data, and pulls in the training set using Distributed Cache. I used a Cloudera virtual machine to test the code, and it works in pseudo-distributed mode.
I'm trying to get through Amazon's EC2/EMR documentation ... it seems like there should be a way to easily convert working Java Hadoop code into something that will work in EC2, but I'm seeing a whole bunch of custom amazon import statements and methods that I've never seen before.
Here's my driver code for an example:
import java.net.URI;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KNNDriverEC2 extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt("rows",1000);
conf.setInt("columns",613);
DistributedCache.createSymlink(conf);
// might have to start next line with ./!!!
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_sample.csv#train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_labels.csv#train_labels.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);
Job job = new Job(conf);
job.setJarByClass(KNNDriverEC2.class);
job.setJobName("KNN");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(KNNMapperEC2.class);
job.setReducerClass(KNNReducerEC2.class);
// job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new KNNDriverEC2(), args);
System.exit(exitCode);
}
}
I've gotten my instance running, but an exception is thrown at the line "FileInputFormat.setInputPaths(job, new Path(args[0]));". I'm going to try to work through the documentation on handling arguments, but I've run into so many errors so far I'm wondering if I'm far from making this work. Any help appreciated.

Resources