Ignite cache is empty after save? - spark-streaming

My data pipeline is following: Kafka => perform some calculations => load resulting pairs into Ignite cache => print it out
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("MainApplication");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sc, Durations.seconds(10));
JavaIgniteContext<String, Float> igniteContext = new JavaIgniteContext<>(sc, PATH, false);
JavaDStream<Message> dStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Message>
Subscribe(Collections.singletonList(TOPIC), kafkaParams)
)
.map(ConsumerRecord::value);
JavaPairDStream<String, Message> pairDStream =
dStream.mapToPair(message -> new Tuple2<>(message.getName(), message));
JavaPairDStream<String, Float> pairs = pairDStream
.combineByKey(new CreateCombiner(), new MergeValue(), new MergeCombiners(), new HashPartitioner(10))
.mapToPair(new ToPairTransformer());
JavaIgniteRDD<String, Float> myCache = igniteContext.fromCache(new CacheConfiguration<>());
// I know that we put something here:
pairDStream.foreachRDD((VoidFunction<JavaPairRDD<String, Float>>) myCache::savePairs);
// But I can't see anything here:
myCache.foreach(tuple2 -> System.out.println("In cache: " + tuple2._1() + " = " + tuple2._2()));
streamingContext.start();
streamingContext.awaitTermination();
streamingContext.stop();
sc.stop();
But this code prints nothing.. Why?
Why Ignite cache is empty even after savePairs?
What can be wrong here?
Thanks in advance!

For me, it looks like that pairDStream.foreachRDD(...) is a lazy operation and has no any affect at least before you start streaming context streamingContext.start().
On the other hand, myCache.foreach(...) is eager operation and you perform it on actually empty cache.
So, try to put myCache.foreach(...) after streaming context start. Or even after termination.

Related

How to extract and manipulate data within a Nifi processor

I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.

Accessing Tivoli Performance Module

I am trying to extract the PMI data using a Java Application, I already been able to access performance module, but unfortunately, i cannot access SubModule as in the below example.
I extracted the ThreadPool Module data using this code
StatDescriptor mysd = new StatDescriptor(new String[] { PmiConstants.THREADPOOL_MODULE });
MBeanStatDescriptor mymsd = new MBeanStatDescriptor(nodeAgent, mysd);
Object[] params = new Object[]{mymsd, new Boolean(false)};
String[] signature = new String[] { "com.ibm.websphere.pmi.stat.MBeanStatDescriptor", "java.lang.Boolean" };
com.ibm.ws.pmi.stat.StatsImpl myStats = (StatsImpl) adminClient.invoke(perfOn, "getStatsObject", params, signature);
//System.out.println("myStats Size = " + myStats.dataMembers().size()+ "\n" + myStats.toString());
but I cannot access Threadpool submodules and their counters as AriesThreadPool
any recommended suggestion?
I solved the problem
just I enabled the recursively searching
by replacing the parameter from false to true
Object[] params = new Object[]{mymsd, new Boolean(true)};

AWS Elasticsearch Avg aggregation Java API generates error while trying to retrieve results using Jest client during a certain time period

I am using Elasticsearch SDK 2.3 and also Jest 2.0.0 to create a client. I am trying to get average aggregation (and other aggregations) implemented which will retrieve results from a certain time period.
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
//Queries builder
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
GregorianCalendar gC = new GregorianCalendar();
gC.set(2016, Calendar.OCTOBER, 18, 0, 0, 0);
long from = gC.getTimeInMillis();
gC.add(Calendar.MINUTE, 15);
long to = gC.getTimeInMillis();
searchSourceBuilder.postFilter(QueryBuilders.rangeQuery("timestamp").from(from).to(to));
JestClient client = getJestClient();
AvgBuilder aggregation2 = AggregationBuilders
.avg(AvgAggregation.TYPE)
.field("backend_processing_time");
AggregationBuilder ag = AggregationBuilders.dateRange(aggregation2.getName())
.addRange("timestamp", from, to).
subAggregation(aggregation2);
searchSourceBuilder.aggregation(ag);
String query = searchSourceBuilder.toString();
Search search = new Search.Builder(query)
.addIndex(INDEX)
.addType(TYPE)
// .addSort(new Sort("code"))
.setParameter(Parameters.SIZE, 5)
// .setParameter(Parameters.SCROLL, "5m")
.build();
SearchResult result = client.execute(search);
System.out.println("ES Response with aggregation:\n" + result.getJsonString());
And the error that I am getting is the following:
{"error":{"root_cause":[{"type":"aggregation_execution_exception","reason":"could not find the appropriate value context to perform aggregation [avg]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"elbaccesslogs_2016_10","node":"5ttEmYcTTsie-z23OpKY0A","reason":{"type":"aggregation_execution_exception","reason":"could not find the appropriate value context to perform aggregation [avg]"}}]},"status":500}
Looks like the reason is 'could not find the appropriate value context to perform aggregation [avg]' ... which I don't know really what is going on.
Can anyone suggest anything please? Or if need more info on this before responding, please let me know.
Thanks,
Khurram
I have found the solution myself so I am explaining below and closing the issue which is as follows:
Basically the code
searchSourceBuilder.postFilter(QueryBuilders.rangeQuery("timestamp").from(from).to(to));
doesn't work with aggregations. We need to remove 'postFilter' code and the following code:
AggregationBuilder ag = AggregationBuilders.dateRange(aggregation2.getName())
.addRange("timestamp", from, to).
subAggregation(aggregation2);
And change the following code too:
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
to
searchSourceBuilder.query(
QueryBuilders.boolQuery().
must(QueryBuilders.matchAllQuery()).
filter(QueryBuilders.rangeQuery("timestamp").from(from).to(to))
);
So here is the entire code again:
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
GregorianCalendar gC = new GregorianCalendar();
gC.set(2016, Calendar.OCTOBER, 18, 0, 0, 0);
long from = gC.getTimeInMillis();
gC.add(Calendar.MINUTE, 15);
long to = gC.getTimeInMillis();
searchSourceBuilder.query(
QueryBuilders.boolQuery().
must(QueryBuilders.matchAllQuery()).
filter(QueryBuilders.rangeQuery("timestamp").from(from).to(to))
);
JestClient client = getJestClient(); // just a private method creating a client the way Jest client is created
AvgBuilder avg_agg = AggregationBuilders
.avg(AvgAggregation.TYPE)
.field("backend_processing_time");
searchSourceBuilder.aggregation(avg_agg);
String query = searchSourceBuilder.toString();
Search search = new Search.Builder(query)
.addIndex(INDEX)
.addType(TYPE)
.setParameter(Parameters.SIZE, 10)
.build();
SearchResult result = client.execute(search);
System.out.println("ES Response with aggregation:\n" + result.getJsonString());

Get the total mapping and reducing times in hadoop programmatically

I am trying to calculate the individual total times of Mapping, Shuffling and Reducing by all tasks in my MR code.
I need help retrieving that information for each MapReduce Job.
Can someone post any code snippet that does that calculation?
You need to use the JobClient API as shown below:
There are however some quirks to the API. Try it let me know i will help you out.
JobClient client = null;
Configuration configuration = new Configuration();
configuration.set("mapred.job.tracker", jobTrackerURL);
client = new JobClient(new JobConf(configuration));
while (true) {
List<JobStatus> jobEntries = getTrackerEntries(jobName,
client);
for (JobStatus jobStatus : jobEntries) {
JobID jobId = jobStatus.getJobID();
String trackerJobName = client.getJob(jobId)
.getJobName();
TaskReport[] mapReports = client
.getMapTaskReports(jobId);
TaskReport[] reduceReports = client
.getReduceTaskReports(jobId);
client.getJob(jobId).getJobStatus().getStartTime();
int jobMapper = mapReports.length;
mapNumber = +jobMapper;
int jobReducers = reduceReports.length;
reduceNumber = +jobReducers;
}
}

How can I know when the amazon mapreduce task is complete?

i am trying to run a mapreduce task on amazon ec2.
i set all the configuration params and then call runFlowJob method of the AmazonElasticMapReduce service.
i wonder is there any way to know whether the job has completed and what was the status.
(i need it to know when i can pick up the mapreduce results from s3 for further processing)
currently the code just keep executing bacause the call to runJobFlow is non-blocking.
public void startMapReduceTask(String accessKey, String secretKey
,String eC2KeyPairName, String endPointURL, String jobName
,int numInstances, String instanceType, String placement
,String logDirName, String bucketName, String pigScriptName) {
log.info("Start running MapReduce");
// config.set
ClientConfiguration config = new ClientConfiguration();
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config);
service.setEndpoint(endPointURL);
JobFlowInstancesConfig conf = new JobFlowInstancesConfig();
conf.setEc2KeyName(eC2KeyPairName);
conf.setInstanceCount(numInstances);
conf.setKeepJobFlowAliveWhenNoSteps(true);
conf.setMasterInstanceType(instanceType);
conf.setPlacement(new PlacementType(placement));
conf.setSlaveInstanceType(instanceType);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installPig = new StepConfig()
.withName("Install Pig")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallPigStep());
StepConfig runPigScript = new StepConfig()
.withName("Run Pig Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, ""));
RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf)
.withSteps(enableDebugging, installPig, runPigScript)
.withLogUri("s3n://" + bucketName + "/" + logDirName);
try {
RunJobFlowResult res = service.runJobFlow(request);
log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully");
} catch (Exception e) {
log.error("Caught Exception: ", e);
}
log.info("End running MapReduce");
}
thanks,
aviad
From the AWS documentation:
Once the job flow completes, the cluster is stopped and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
It goes on to say:
If the JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the job flow will transition to the WAITING state rather than shutting down once the steps have completed.
A maximum of 256 steps are allowed in each job flow.
For long running job flows, we recommended that you periodically store your results.
So it looks like there is no way of knowing when it is done. Instead you need to save your data as part of the job.

Resources