Flowfiles stuck in queue (Apache NiFi) - apache-nifi

I have following flow:
ListFTP -> RouteOnAttribute -> FetchFTP -> UnpackContent -> ExecuteScript.
Some of files are stuck in queue UnpackContent -> ExecuteScript.
ExecuteScript ate some flowfiles and they just disappeared: failure and success relationships are empty. It just showed some activity in Tasks/Time field. All of them stuck in queue before ExecuteScript. I tried to empty queue, but not all of flowfiles have been deleted from this queue. About 1/3 of them still stuck in queue. I tried to disable all processors and empty queue again but it returns: 0 FlowFiles (0 bytes) were removed from the queue.
When i'm trying to change Connection destionation it returns:
Cannot change destination of Connection because FlowFiles from this Connection are currently held by ExecuteScript[id=d33c9b73-0177-1000-5151-83b7b938de39]
ExecuScript from this answer (uses Python).
So, I can't empty queue because its always return message that there is no any flowfile, and i can't remove connection. This has been going on for several hours.
Connection configuration:
Scheduling is set to 0 sec, no penalties for flowfiles, etc.
Is it script problem?
UPDATE
Changed script to:
flowFile = session.get()
if (flowFile != None):
# All processing code starts at this indent
if errorOccurred:
session.transfer(flowFile, REL_FAILURE)
else:
session.transfer(flowFile, REL_SUCCESS)
# implicit return at the end
Same result.
UPDATE v2
I set concurent tasks to 50 and then ran ExecuteScript again and terminated it. I got this error:
UPDATE v3
I created additional ExecuteScript processor with same script and it works fine. But after i stopped this new processor and create new flowfiles, this processor now have same problems: it's just stuck.
Hilarious. Is ExecuteScript for single use?

You need to modify Your code in nifi-1.13.2 because NIFI-8080 caused these bugs. Or you just use nifi 1.12.1
JythonScriptEngineConfigurator:
#Override
public Object init(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
// Always compile when first run
if (engine != null) {
// Add prefix for import sys and all jython modules
prefix = "import sys\n"
+ Arrays.stream(modulePaths).map((modulePath) -> "sys.path.append(" + PyString.encode_UnicodeEscape(modulePath, true) + ")")
.collect(Collectors.joining("\n"));
}
return null;
}
#Override
public Object eval(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
Object returnValue = null;
if (engine != null) {
returnValue = ((Compilable) engine).compile(prefix + scriptBody).eval();
}
return returnValue;
}

Related

KTable & LogAndContinueExceptionHandler

I have a very simple consumer from which I create a materialized view. I have enabled validation on my value object (throwing Constraintviolationexception for invalid json data). When I receive a value on which the validation fails, I exepct the value to logged & consumer should read the next offset as I have LogAndContinueExceptionHandler enabled.
However LogAndContinueExceptionHandler is never invoked and consumePojo State transition from PENDING_ERROR to ERROR
Code
#Bean
public Consumer<KTable<String, Pojo>> consume() {
return values->
values
.filter((key, value) -> Objects.nonNull(key))
.mapValues(value-> value, Materialized.<String, Pojo>as(Stores.inMemoryKeyValueStore("POJO_STORE_NAME"))
.withKeySerde(Serdes.String())
.withValueSerde(SerdeUtil.pojoSerde())
.withLoggingDisabled())
.toStream()
.peek((key, value) -> log.debug("Receiving Pojo from topic with key: {}, and UUID: {}", key, value == null ? 0 : value.getUuid()));
}
Why is it that LogAndContinueExceptionHandler is not invoked in case of KTable?
Note: If code is changed to KStreams then I see logging and records being skipped but with KTable not !!
In order to handle exceptions not handled by Kafka Streams use the KafkaStreams.setUncaughtExceptionHandler method and StreamsUncaughtExceptionHandler implementation, this needs to return one of 3 available enumerations:
REPLACE_THREAD
SHUTDOWN_CLIENT
SHUTDOWN_APPLICATION
and in your case REPLACE_THREAD is the best option, as you can see in KIP-671:
REPLACE_THREAD:
The current thread is shutdown and transits to state DEAD.
A new thread is started if the Kafka Streams client is in state RUNNING or REBALANCING.
For the Global thread this option will log an error and revert to shutting down the client until the option had been added
In Spring Kafka you can replace default StreamsUncaughtExceptionHandler by StreamsBuilderFactoryBean:
#Autowired
void setMyStreamsUncaughtExceptionHandler(StreamsBuilderFactoryBean streamsBuilderFactoryBean) {
streamsBuilderFactoryBean.setStreamsUncaughtExceptionHandler(exception -> StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
}
I was able to solve the problem after looking at the logs carefully, I found that valueSerde for the Pojo, was showing useNativeDecoding (default being JsonSerde) due to this DeserializationExceptionHandler wasn't invoked and thread terminated.
Problem went away when I fixed the valueSerde in application.properties

Spring Batch Best Architecture to Read XML

What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.
Our Current Approach
30 partitions and 30 Grids and Each slave gets 166 XMLS
Commit Chunk 100
Application Start Memory is 8 GB
Using JAXB in Reader Default Bean Scope
#StepScope
#Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
#Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
String readingFile = "File Not Found";
logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
List<BaseDTO> fileList = new ArrayList<BaseDTO>();
for (String filePath : fileNameList) {
try {
readingFile = filePath.trim();
Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
unifiedDTO.setFileName(filePath);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
} catch (Exception e) {
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
unifiedDTO.setFileName(readingFile);
unifiedDTO.setErrorMessage(e);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
}
}
return new IteratorItemReader<>(fileList);
}
Our questions:
Is this Archirecture correct
Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
How to handle memory properly to avoid slow down
I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:
Restartability: If a job fails, you only restart the failed file (from where it left off)
Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs
We are receiving XML filepath in a *.txt file
You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.

Why are my storm topology not acking when i send tuple to ElasticSearch

I'm new to using Storm, I've just started a Data Architect training course and it's in this context that I'm facing the problem that brings me to you today.
I'm receiving messages from kakfa via a KafkaSpout named CurrentPriceSpout. So far, everything is working. Then, in my CurrentPriceBolt, I re-issued a tuple so that my data is written in ElasticSearch using the EsCurrentPriceBolt . The problem is here. I can't write my data into ElasticSearch directly, it is only written when I delete my topology.
Is there a Storm parameter that can force the writing of tuples by retrieving acknowledgments?
I tried by adding the option ".addConfiguration(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5)", the tuples are well written in ElasticSearch but not acknowledged. So Storm rewrites them indefinitely.
Thanks for your help
Thierry
I managed to find the answer to my problem.
The main problem was that ES is not designed to ingest as little data as is generated in a study project. ES writes, by default, data in batches of 1000 entries. With this project, I generate one data every 30 seconds, or a batch of 1000 every 500 minutes (or 8h20).
so I reviewed in detail the configuration of my topology and played with the following options:
es.batch.size.entries: 1
es.storm.bolt.flush.entries.size: 1
topology.producer.batch.size: 1
topology.transfer.batch.size: 1
And now it goes like this:
...
...
public class App
{
...
...
public static void main( String[] args ) throws AlreadyAliveException, InvalidTopologyException, AuthorizationException
{
...
...
StormTopology topology = topologyBuilder.createTopology(); // je crée ma topologie Storm
String topologyName = properties.getProperty("storm.topology.name"); // je nomme ma topologie
StormSubmitter.submitTopology(topologyName, getTopologyConfig(properties), topology); // je démarre ma topologie sur mon cluster storm
System.out.println( "Topology on remote cluster : Started!" );
}
private static Config getTopologyConfig(Properties properties)
{
Config stormConfig = new Config();
stormConfig.put("topology.workers", Integer.parseInt(properties.getProperty("topology.workers")));
stormConfig.put("topology.enable.message.timeouts", Boolean.parseBoolean(properties.getProperty("topology.enable.message.timeouts")));
stormConfig.put("topology.message.timeout.secs", Integer.parseInt(properties.getProperty("topology.message.timeout.secs")));
stormConfig.put("topology.transfer.batch.size", Integer.parseInt(properties.getProperty("topology.transfer.batch.size")));
stormConfig.put("topology.producer.batch.size", Integer.parseInt(properties.getProperty("topology.producer.batch.size")));
return stormConfig;
}
...
...
...
}
And now it works!!!

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

What happens when all the datanodes fail in hadoop?

I read from hadoop operations that if a datanode fails during writing process,
A new replication pipeline containing the remaining datanodes is
opened and the write resumes. At this point, things are mostly back to
normal and the write operation continues until the file is closed. The
namenode will notice that one of the blocks in the file is
under-replicated and will arrange for a new replica to be created
asynchronously. A client can recover from multiple failed datanodes
provided at least a minimum number of replicas are written (by
default, this is one).
But what happens if all the datanodes fail? i.e., minimum number of replicas are not written?
Will client ask namenode to give new list of datanodes? or will the job fail?
Note : My question is NOT what happens when all the data nodes fails in the cluster. Question is what happens if all the datanodes to which the client was supposed to write, fails, during the write operation
Suppose namenode told the client to write BLOCK B1 to datanodes D1 in Rack1, D2 in Rack2 and D3 in Rack1. There might be other racks also in the cluster(Rack 4,5,6,...). If Rack1 and 2 failed during the write process, client knows that the data was not written successfully since it didn't receive the ACK from the datanodes, At this point, will it ask Namenode to give new set of datanodes? may be in the still alive Racks ?
OK I got what you are asking. DFSClient will get a list of datanodes from the namenode where it is supposed to write a block (say A) of a file. DFSClient will iterate over that list of Datanodes and write the block A in those locations. If block write fails in the first datanodes, it'll abandon the block write and ask namenode a new set of datanodes where it can attempt to write again.
Here the sample code from DFSClient that explains that -
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {
//----- other code ------
do {
hasError = false;
lastException = null;
errorIndex = 0;
retry = false;
nodes = null;
success = false;
long startTime = System.currentTimeMillis();
lb = locateFollowingBlock(startTime);
block = lb.getBlock();
accessToken = lb.getBlockToken();
nodes = lb.getLocations();
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);
if (!success) {
LOG.info("Abandoning block " + block);
namenode.abandonBlock(block, src, clientName);
// Connection failed. Let's wait a little bit and retry
retry = true;
try {
if (System.currentTimeMillis() - startTime > 5000) {
LOG.info("Waiting to find target node: " + nodes[0].getName());
}
Thread.sleep(6000);
} catch (InterruptedException iex) {
}
}
} while (retry && --count >= 0);
if (!success) {
throw new IOException("Unable to create new block.");
}
return nodes;
}

Resources