Can the following design be accomplished in Storm?
Lets take the wordcount example that is present in the following
https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java
I am changing the word generator spout to a file reader spout
The design for this Word Count Topology is
1. Spout to read file and create sentences line by line
2. Bolt to split sentences to words
3. Bolt to add unique words and give a word and its corresponding count
So in a way the topology is describing the flow a file needs to take to count the unique words it has.
If I have two files file 1 and file 2 one should be able to call the same topology and create two instance of this topology to run the same word count.
In order to track if the word count has indeed finished the instances of word count topology should have a completed status once the file has been processed.
In the current design of Storm, I find that the Topology is the actual instance so it is like a task.
One needs to make two different calls with different Topology names like
for file 1
StormSubmitter.submitTopology("WordCountTopology1", conf,builder.createTopology());
for file 2
StormSubmitter.submitTopology("WordCountTopology2", conf,builder.createTopology());
not to mention the same upload of the jar using the storm client
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount1Main.App "server" "filepath1"
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount2Main.App "server" "filepath2"
The other issue is the topologies don't complete once the file is processed. They are alive all the time before we issue a kill on the topology
storm kill "WordCountTopology"
I understand that in a streaming world where the messages are coming from a message queue like Kafka there is no end of message but how is that relevant in the file world where the entities/messages are fixed.
Is there an API that does the following?
//creates the topology, this is done one time using the storm to upload the respective jars
StormSubmitter.submitTopology("WordCountTopology", conf,builder.createTopology());
Once uploaded the application code just instantiates the topology with the agruments
//creates an instance of the topology and give a status tracker
JobTracker tracker = StormSubmitter.runTopology("WordCountTopology", conf, args);
//Can query the Storm for the current job if its complete or not
JobStatus status = StormSubmitter.getTopologyStatus(conf, tracker);
For reusing the same topology twice, you have two possibilities:
1) Use a constructor parameter for your file spout and instantiate the same topology with twice with different parameters:
private StormTopology createMyTopology(String filename) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("File Spout", new FileSpout(filename));
// add further spouts and bolts etc.
return builder.createTopology();
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
StormSubmitter.submitTopology("T1", c, createMyTopology(file1));
} else {
StormSubmitter.submitTopology("T1", c, createMyTopology(file2));
}
}
2) As an alternative, you could configure your file spout in open() method.
public class FileSpout extends IRichSpout {
#Override
public void open(Map conf, ...) {
String filenmae = (String)conf.get("FILENAME");
// ...
}
// other methods omitted
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
c.put("FILENAME", file1);
} else {
c.put("FILENAME", file2);
}
// assembly topology...
StormSubmitter.submitTopology("T", c, builder.createTopology());
}
For you second question: there is no API in Storm that terminates a topology automatically. You could use TopologyInfo and monitor the number of emitted tuples of the spout. If it does not change for some time, you can assume that the whole file got read and then kill the topology.
Config cfg = new Config();
// set NIMBUS_HOST and NIMBUS_THRIFT_PORT in cfg
Client client = NimbusClient.getConfiguredClient(cfg).getClient();
TopologyInfo info = client.getTopologyInfo("topologyName");
// get emitted tuples...
client.killTopology("topologyName");
The word count topology mentioned in the post doesn't do justice for the might and power of Storm. Since Storm is a Stream processor, it requires a stream; period. By definition files are files it is static. I empathize with the Storm developers on how can a simple hello world be given to the adoption on how to show case the topology concepts and a non stream technology like file was taken. So to the newbies who are learning Storm which I was at that time, it was a difficult to understand how to develop using the example. The example is just a way to show how Storm concepts work, not a real word application of how files would come or needs to be processed.
So here is the take on how one of the solution could be.
Since topologies run all the time, they can compute the word count for as long as one wants i,e within a file or across all files for any periods of time.
In order to allow for different files to come in, we would need a streaming spout. So naturally you would need a Kafka Message Broker or similar to receive files in a stream. Depending on the size of the file and the restriction that message brokers put namely Kafka which has a 1 MB file restriction, we could pick to send the file itself as the payload or the reference of the file in which case you would need a distributed file system to store the file namely a Hadoop DFS or a NAS.
We then read these files using a Kafka Spout as opposed to FileSpout.
We now have the following issues
1. Word Count Across Files
2. Word Count per File
3. Running Status on the word count till it is processed
4. When do we know if a file is processed or complete
Word Count Across Files
Using the example provided, this is the use case the example targets so if we continue to stream the files and in each file we read the lines, split the word and send to other bolts, the bolts would count the words independent of which file it came from.
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
Field Grouping
quick
brown
fox
...
Once
upon
fox (not needed as it came in file 1)
...
Word Count Per File
In order to do this, we would now need to put the fields grouping of words to be appended with the fileId. So now the example needs to change to include a fileId for each word it splits.
So
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
So the fields grouping on word would be (canceling the noise words)
File1_quick
File1_brown
File1_fox
File2_once
File2_upon
File2_fox
Running Status on the word count till it is processed
Since all these counts are in memory of the bolt and we don't know the EoF there is no way to get the status unless someone peaks into the bolt or we send the counts periodically to another data store where we can query it. This is exactly what we need to do, which is at periodic intervals we need to persist the in-memory bolt counts to a data store like hbase, elastic, mongo db etc
When do we know if a file is processed or complete
Perhaps this is the toughest question to answer in the streaming world, basically the stream processor doesn't know the steam is finished as from its perspective the streams are files coming in and it needs to split each file into words and count in corresponding bolts. So they don't know what has happened before or after it reached each actor.
This entire thing needs to be done by the app developer.
One way to do this is when each file is read we count the total words and send a message
File 1 : Total Words : 1000
File 2 : Total Words : 2000
Now when we do the word count and find different words per file File1_* the count of individual words and the total words should match before we say a file is complete. All these are custom logic we would need to write before we can say its complete.
So in essential Storm provides the framework to do stream processing in a variety of ways. Its the application developers job to develop with the design that it has and implement their own logic depending on the use case. It doesn't provide application use cases out of the box or a good reference implementation which I think we need to build as its not a commercial product and depends on community to champion.
Related
I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.
I making storm topology and I'm dealing with strings from this format: "x-x-x-x" where x is some digit. I want the strings stream to be split between 4 bolts equaly.
The problem is that for the following code, all the bolts get all the tuples, instead of send eack tuple to exactly one bolt:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("level-1", new SomeBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-2", new SomeBolt(2)).shuffleGrouping("digits-spout");
builder.setBolt("level-3", new SomeBolt(3)).shuffleGrouping("digits-spout");
builder.setBolt("level-4", new SomeBolt(4)).shuffleGrouping("digits-spout");
as you can see i use same bolt but different consturctor.
Thanks!
According to what I understand from your question, I may offer an extra bolt for your problem like following example:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("stringSplitterBoltName", new
StringSplitterBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-1", new
SomeBolt(1)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-2", new
SomeBolt(2)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-3", new
SomeBolt(3)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-4", new
SomeBolt(4)).shuffleGrouping("stringSplitterBoltName");
If you want the bolts to have different processing logic, you can just add 4 tasks of the same bolt. In this case, you will receive messages randomly between the bolt instances. You can check for the string value within that bolt and take appropriate execution path. You will avoid separate codebase for 4 bolts.
Alternatively, if you want to have separate bolt code for the strings, go for above suggestion by zackeriya.
I am using storm with trident topology but I am not able to understand how the parallelism is attained, it is differing according to my calculation and what I seeing on storm UI,
Here's the code the assign number of workers:
public Config getTopologyConfiguration() {
Config conf = new Config();
//conf.setDebug(true);
conf.setNumWorkers(6);
conf.setMessageTimeoutSecs(100);
return conf;
}
And Here is the stream processing code:
s.name("aggregation_stream")
.parallelismHint(invoiceAggregationConfig.getSpoutParallelism())
.partitionBy(groupedFields)
.partitionAggregate(aggregateInputFields,
new GenericAggregator(groupedFields, aggregatedFieldsList, aggregateFieldsOperationList),
aggregatorOutputFields)
.parallelismHint(invoiceAggregationConfig.getAggregationParallelism())
.shuffle()
.each(aggregatorOutputFields,
new CreatePaymentFromInvoices(paymentType, groupMap, aggMap, paymentExtraParams),
Const.PAYMENT_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD),
new CreateApplicationFromPaymentAndInvoices(invoiceType),
Const.APPLICATIONS_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD, Const.APPLICATIONS_FIELD),
new RestbusFilterForPaymentAndApplications(environment, bu, serviceConfiguration))
.parallelismHint(invoiceAggregationConfig.getPersistenceParallelism());
and the parallelism attributes which I am using in the code above are here:
spoutParallelism: 3
aggregationParallelism: 6
persistenceParallelism: 6
Now according to my calculation the number of executors should be
3*6 + 6 = 24
But in Storm UI it is showing 23, how??
EDITED
Adding new screenshot which is having information about individual components
Here I can see the number of Executors and tasks are 50, but I didn't set any configuration to for this, does storm provide this itself??
Secondly, The number of emitted tuples is huge in number, I am not producing this much data,this is more than 100s times more tuples, how come this much tuples are showing in UI??
Number of emitted tuples can be huge number
Reason : when spout emit a tuple it will expect the ack, if ack not received it will resend the tuple so emitted and transferred count can be a higher value. (check ack count its small number with compare to emitted count)
I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing
I have a Hadoop job that has ~60k S3 input paths. This job takes about 45 minutes to start. The same job, with only ~3k S3 input paths starts almost instantly.
Why does having a large number of input paths cause the job to take so long to start?
The answer has to do with how FileInputPath.addInputPath(...) is implemented. If you take a look at the source here, you'll see that its actually doing a string concatenation to save all of these paths to a file. Calling addInputPaths(...) just calls addInputPath, so there's no savings there. I ended up calling FileInputPath.setInputPaths(Job, Path[]). This skips the 60k+ string concatenations by building that part of the settings file once.
As climbage mentioned, there will need to be 60k+ calls to S3 to build the splits. It turns out that the S3 calls were taking less time than the string concatenation. My jobs went from taking 45 minutes to start down to less than 20.
For those who don't want to go combing through the source, heres the implementation of FileInputFormat.addInputPath() in Hadoop 2.5.1:
public static void addInputPath(Job job,
Path path) throws IOException {
Configuration conf = job.getConfiguration();
path = path.getFileSystem(conf).makeQualified(path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get(INPUT_DIR);
conf.set(INPUT_DIR, dirs == null ? dirStr : dirs + "," + dirStr);
}
and FileInputFormat.setInputPaths() in Hadoop 2.5.1:
public static void setInputPaths(Job job,
Path... inputPaths) throws IOException {
Configuration conf = job.getConfiguration();
Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
for(int i = 1; i < inputPaths.length;i++) {
str.append(StringUtils.COMMA_STR);
path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
str.append(StringUtils.escapeString(path.toString()));
}
conf.set(INPUT_DIR, str.toString());
}
One of the first things that FileInputFormat does during MapReduce initialization is determine the input splits. This is done by creating a list of every input file and its information (such as file size). I imagine that 60k API calls to S3 for file information isn't fast. 45 minutes seems extraordinarily slow - there may be some rate limiting going on as well?
Sorry for reopening an old question but I recently came across similar issue.
And the core of it is that in your case Hadoop will make 60K calls to AWS
To work around this one can use wildcards
FileInputFormat.addInputPath("path_to_a_folder/prefix*")
this will generate only 1 AWS call to list the directory path_to_a_folder and then filter by the prefix
I hope this will help to whoever will find this question