Not possible to set job description in spark streamingContext? - spark-streaming

I am try to do
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("appName")
val sc: SparkContext = new SparkContext(sparkConf)
sc.setJobGroup("jobName", "job description")
val ssc: StreamingContext = new StreamingContext(sc, Seconds(2))
And it seems that setJobGroup is never working. I saw there is https://issues.apache.org/jira/browse/SPARK-10649, which it seems that StreamingContext overwrites information in the SparkContext. However StreamingContext doesn't provide any method to set Spark job description. I am wondering is there any way to customize job description for Spark streaming job?

Related

Ignite cache is empty after save?

My data pipeline is following: Kafka => perform some calculations => load resulting pairs into Ignite cache => print it out
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("MainApplication");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sc, Durations.seconds(10));
JavaIgniteContext<String, Float> igniteContext = new JavaIgniteContext<>(sc, PATH, false);
JavaDStream<Message> dStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Message>
Subscribe(Collections.singletonList(TOPIC), kafkaParams)
)
.map(ConsumerRecord::value);
JavaPairDStream<String, Message> pairDStream =
dStream.mapToPair(message -> new Tuple2<>(message.getName(), message));
JavaPairDStream<String, Float> pairs = pairDStream
.combineByKey(new CreateCombiner(), new MergeValue(), new MergeCombiners(), new HashPartitioner(10))
.mapToPair(new ToPairTransformer());
JavaIgniteRDD<String, Float> myCache = igniteContext.fromCache(new CacheConfiguration<>());
// I know that we put something here:
pairDStream.foreachRDD((VoidFunction<JavaPairRDD<String, Float>>) myCache::savePairs);
// But I can't see anything here:
myCache.foreach(tuple2 -> System.out.println("In cache: " + tuple2._1() + " = " + tuple2._2()));
streamingContext.start();
streamingContext.awaitTermination();
streamingContext.stop();
sc.stop();
But this code prints nothing.. Why?
Why Ignite cache is empty even after savePairs?
What can be wrong here?
Thanks in advance!
For me, it looks like that pairDStream.foreachRDD(...) is a lazy operation and has no any affect at least before you start streaming context streamingContext.start().
On the other hand, myCache.foreach(...) is eager operation and you perform it on actually empty cache.
So, try to put myCache.foreach(...) after streaming context start. Or even after termination.

Unable to create cluster in AWS EMR

I was trying to create an EMR cluster from Eclipse using Java. I was able to output the job id but when I viewed the EMR web console. There wasn't any cluster created. What's wrong ?
My code:
AWSCredentials credentials = new BasicAWSCredentials("xxx", "xxx");
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installHive = new StepConfig()
.withName("Install Hive")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallHiveStep());
StepConfig hiveScript = new StepConfig().withName("Hive Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunHiveScriptStep("s3://mywordcountbuckett/binary/WordCount2.jar"));
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("Hive Interactive")
.withReleaseLabel("emr-4.1.0")
.withSteps(enableDebugging, installHive)
.withJobFlowRole("EMR_DefaultRole")
.withServiceRole("EMR_EC2_DefaultRole")
//.withSteps(enableDebugging, installHive)
.withLogUri("s3://mywordcountbuckett/")
.withInstances(new JobFlowInstancesConfig()
.withEc2KeyName("mykeypair")
.withHadoopVersion("2.4.0")
.withInstanceCount(5)
.withKeepJobFlowAliveWhenNoSteps(true)
.withMasterInstanceType("m3.xlarge")
.withSlaveInstanceType("m1.large"));//.withAmiVersion("3.10.0");
RunJobFlowResult result = emr.runJobFlow(request);
System.out.println(result.toString());
My output was:
{JobFlowId: j-3T7H65FOSKHDQ}
What could be the reason that I was unable to create the cluster ?

Create HbaseConfiguration once in MapReduce job

I am writing a map-reduce job in java.
I want to use external table for writing hbase increments object.
For that, I am creating new HbaseConfiguration.
I want to be able to create once and use in all mappers.
Any idea?
The Job configuration is already passed to the mappers (and to the reducers) as a part of the context. You can access any HBase table with it.
Job job = new Job(HbaseConfiguration.create(), ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
HTable mytable = new HTable(config, "my_table_name");
....
You can even send your custom parameters or arguments so you can instantiate any kind of object you could need on your mappers:
Configuration config = HbaseConfiguration.create();
config.set("myStringParam", "customValue");
config.setStrings("myStringsParam", "customValue1", "customValue2", "customValue3");
config.setBoolean("myBooleanParam", true);
Job job = new Job(config, ...);
/* ... Rest of the job setup ... */
job.waitForCompletion(true);
Within your mapper setup method:
Configuration config = context.getConfiguration();
String myStringParam = config.get("myStringParam");
String myBooleanParam = config.get("myBooleanParam");
BTW: I don't know why this question has so many downvotes, it's not a bad question.

how to get job counters with sqoop 1.4.4 java api?

I'm using Sqoop 1.4.4 and its java api to run an import job and I'm having
trouble figuring out how to access the job counters once the import has
completed. I see suitable methods in the ConfigurationHelper class, like
getNumMapOutputRecords, but I'm not sure how to pass the job to them.
Is there a way to get at the job from the SqoopTool or Sqoop objects?
My code looks something like this:
SqoopTool sqoopTool = new ImportTool();
SqoopOptions options = new SqoopOptions();
options.setConnectString(connectString);
options.setUsername(username);
options.setPassword(password);
options.setTableName(table);
options.setColumns(columns);
options.setWhereClause(whereClause);
options.setTargetDir(targetDir);
options.setNumMappers(1);
options.setFileLayout(FileLayout.TextFile);
options.setFieldsTerminatedBy(delimiter);
Configuration config = new Configuration();
config.set("oracle.sessionTimeZone", timezone.getID());
System.setProperty(Sqoop.SQOOP_RETHROW_PROPERTY, "1");
Sqoop sqoop = new Sqoop(sqoopTool, config, options);
String[] nullArgs = new String[0];
Sqoop.runSqoop(sqoop, nullArgs);

Files not put correctly into distributed cache

I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.
Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

Resources