How to provide values to storm for calculation - apache-storm

I have a hard time understanding how to provide values to storm since i am a newbie to storm.
I started with the starter kit. I went through the TestWordSpout and in that the following code provides new values
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
So i see it's taking one word at a time _collector.emit(new Values(word));
How i can provide a collection of words directly.Is this possible?
TestWordSpout.java
What I mean when nextTuple is called a new words is selected at random from the list and emitted. The random list may look like this after certain time interval
#100ms: nathan
#200ms: golda
#300ms: golda
#400ms: jackson
#500ms: mike
#600ms: nathan
#700ms: bertels
What if i already have a collection of this list and just feed it to storm.

Storm is designed and built to process the continuous stream of data. Please see Rationale for the Storm. It's very unlikely that input data is feed into the storm cluster. Generally, the input data to storm is either from the JMS queues, Apache Kafka or twitter feeds etc. I would think, you would like to pass few configurations. In that case, the following would apply.
Considering the Storm design purpose, very limited configuration details can be passed to Storm such as the RDMBS connection details (Oracle/DB2/MySQL etc), JMS provider details(IBM MQ/RabbitMQ etc) or Apache Kafka details/Hbase etc.
For your particular question or providing the configuration details for the above products, there are three ways that I could think
1.Set the configuration details on the instance of the Spout or Bolt
For eg: Declare the instance variables and assign the values as part of the Spout/Bolt constructor as below
public class TestWordSpout extends BaseRichSpout {
List<String> listOfValues;
public TestWordSpout(List<String> listOfValues) {
this.listOfValues=listOfValues;
}
}
On the topology submission class, create an instance of Spout with the list of values
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
builder.setSpout("word", new TestWordSpout(listOfValues), 3);
These values are available as instance variables in the nextTuple() method
Please look at the Storm integrations at Storm contrib on the configurations set for RDBMS/Kafka etc as above
2.Set the configurations in the getComponentConfiguration(). This method is used to override the topology configurations, however, you could pass in few details as below
#Override
public Map<String, Object> getComponentConfiguration() {
Map<String, Object> ret = new HashMap<String, Object>();
if(!_isDistributed) {
ret.put(Config.TOPOLOGY_MAX_TASK_PARALLELISM, 1);
return ret;
} else {
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
ret.put("listOfValues", listOfValues);
}
return ret;
}
and the configuration details are available in the open() or prepare() method of Spout/Bolt respectively.
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;
this.listOfValues=(List<String>)conf.get("listOfValues");
}
3.Declare the configurations in the property file and jar it as part of the jar file that would be submitted to the Storm cluster. The Nimbus node copies the jar file to the worker nodes and makes it available to executor thread. The open()/prepare() method can read the property file and assign to instance variable.

"Values" type accept any kind of objects and any number.
So you can simply send a List for instance from the execute method of a Bolt or from the nextTuple method of a Spout:
List<String> words = new ArrayList<>();
words.add("one word");
words.add("another word");
_collector.emit(new Values(words));
You can add a new Field too, just be sure to declare it in declareOutputFields method
_collector.emit(new Values(words, "a new field value!");
And in your declareOutputFields method
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("collection", "newField"));
}
You can get the fields in the next Bolt in the topology from the tuple object given by the execute method:
List<String> collection = (List<String>) tuple.getValueByField("collection");
String newFieldValue = tuple.getStringByField("newField");

Related

Send data to Spring Batch Item Reader (or Tasklet)

I have the following requirement:
An endpoint http://localhost:8080/myapp/jobExecution/myJobName/execute which receives a CSV and use univocity to apply some validations and generate a List of some pojo.
Send that list to a Spring Batch Job for some processing.
Multiple users could do this.
I want to know if with Spring Batch I can achieve this?
I was thinking to use a queue, put the data and execute a Job that pull objects from that queue. But how can I be sure that if other person execute the endpoint and other Job is executing, Spring Batch Knows which Item belongs to a certain execution?
You can use a queue or go ahead to put the list of values that was generated after the step with validations and store it as part of job parameters in the job execution context.
Below is a snippet to store the list to a job context and read the list using an ItemReader.
Snippet implements StepExecutionListener in a Tasklet step to put List which was constructed,
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
//tenantNames is a List<String> which was constructed as an output of an evaluation logic
stepExecution.getJobExecution().getExecutionContext().put("listOfTenants", tenantNames);
return ExitStatus.COMPLETED;
}
Now "listOfTenants" are read as part of a Step which has Reader (To allow one thread read at a time), Processor and Writer. You can also store it as a part of Queue and fetch it in a Reader. Snippet for reference,
public class ReaderStep implements ItemReader<String>, StepExecutionListener {
private List<String> tenantNames;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
tenantNames = (List<String>)stepExecution.getJobExecution().getExecutionContext()
.get("listOfTenants");
logger.debug("Sucessfully fetched the tenant list from the context");
} catch (Exception e) {
// Exception block
}
}
#Override
public synchronized String read() throws Exception {
String tenantName = null;
if(tenantNames.size() > 0) {
tenantName = tenantNames.get(0);
tenantNames.remove(0);
return tenantName;
}
logger.info("Completed reading all tenant names");
return null;
}
// Rest of the overridden methods of this class..
}
Yes. Spring boot would execute these jobs in different threads. So Spring knows which items belongs to which execution.
Note: You can use like logging correlation id. This will help you filter the logs for a particular request. https://dzone.com/articles/correlation-id-for-logging-in-microservices

Updating global store from data within transform

I currently have a simple topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
My events sometimes have a key/value pair and other times have just the key. I want to be able to add the value to those events that are missing the value. I have this working fine with a local state store but when I add more tasks, sometimes the key/value events and the value events are in different threads and so they aren't updated correctly.
I'd like to use a global state store for this but I'm having difficulty figuring out how to update the global store when new key/value pairs come in. I've created a global state store with the following code:
builder.addGlobalStore(stateStore, "global_store", Consumed.with(Serdes.String(), Serdes.String()), new ProcessorSupplier<String, String>() {
#Override
public Processor<String, String> get() {
return new Processor<String, String>() {
private ProcessorContext context;
#Override
public void init(final ProcessorContext processorContext) {
this.context = processorContext;
}
#Override
public void process(final String key, final String value) {
context.forward(key, value);
}
#Override
public void close() {
}
};
}
});
As far as I can tell, it is working but since there is no data in the topic, I'm not sure.
So my question is how do I update the global store from inside of the transformValues? store.put() fails with an error that global store is read only.
I found Write to GlobalStateStore on Kafka Streams but the accepted answer just says to update the underlying topic but I don't see how I can do that since the topic isn't in my stream.
---Edited---
I updated the code per #1 in the accepted answer. I see the new key/value pairs show up in global_store. But the globalStore doesn't seem to see the new keys. If I restart the application, it fills the cache with the data in the topic but new keys aren't visible until after I stop/start the application.
I added logging to the process(String, String) in the global store processor and it shows new keys being processed. Any ideas?
You can only get a real-only access on Global state store inside transformValues, and if you want to update a global state store, yes, you have to send the update to the underlying input topic of Global state store, and your state will update the value when this update message is consumed. The reason behind this is that, Global state store are populated on all application instances and use this input topic for fault tolerance. You can do this by branching you topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
//processing message as normal
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
//this transform to the updated message to global state
eventsStream.transform(updateGlobalStateProcessorSupplier, "nameCache")
.to("global_store");
Using low level API to construct your Topology manually, so you can forward both to your destinationTopic topic and global_state topic using ProcessorContext.forward to forward message to sink processor node using name of the sink processor.

How can I see the current output of a running Storm topology?

Currently learning on how to use Storm (version 2.1.0), I am a bit confused on a specific aspect of this data streaming processing (DSP) engine: How is output data handled? Tutorials provide good explanations on system setup and running our first application. Unfortunately, I didn't find a page providing details on results generated by a topology.
With DSP applications, there are no final output because input data is a continuously incoming stream of data (or maybe we can say there is a final output when application is stopped). What I would like is to be able to see the state of current output (the actual output data generated at current time) of a running topology.
I'm able to run WordCountTopology. I understand the output of this topology is generated by the following snippet of code:
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
My misunderstanding is on the location of the <"word":string, "count":int> output. Is it only in memory, written in a database somewhere, written in a file?
Going further with this question: what are the existing possibilities for storing in-progress output data? What is the "good way" of handling such data?
I hope my question is not too naive. And thanks to the StackOverflow community for always providing good help.
A few days have passed since I posted this question. I am back to share with you what I have tried. Although I cannot tell if it is the right way of doing, the two following propositions answer my question.
Simple System.out.println()
The first thing I've tried is to make a System.out.println("Hello World!") directly within the prepare() method of my BaseBasicBolt. This method is called only once at the beginning of each Bolt's thread execution.
public void prepare(Map topoConf, TopologyContext context) {
System.out.println("Hello World!");
}
The big challenge was to figure out where the log is written. By default, it is written within <storm installation folder>/logs/workers-artifacts/<topology name>/<worker-port>/worker.log where <worker-port> is the port of a requested worker/slot.
For instance, with conf.setNumWorkers(3), the topology requests an access to 3 workers (3 slots). Therefore, values of <worker-port> will be 6700, 6701 and 6702. Those values are the port numbers of the 3 slots (defined in storm.yaml under supervisor.slots.ports).
Note: you will have as many "Hello World!" as the parallel size of your BaseBasicBolt. When the split bolt is instantiated with builder.setBolt("split", new SplitSentence(), 8), it results in 8 parallel threads, each one writing its own log.
Writing to a file
For research purpose I have to analyse large amounts of logs that I need in a specific format. The solution I found is to append the logs to a specific file managed by each bolt.
Hereafter is my own implementation of this file logging solution for the count bolt.
public static class WordCount extends BaseBasicBolt {
private String workerName;
private FileWriter fw;
private BufferedWriter bw;
private PrintWriter out;
private String logFile = "/var/log/storm/count.log";
private Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map topoConf, TopologyContext context) {
this.workerName = this.toString();
try {
this.fw = new FileWriter(logFile, true);
this.bw = new BufferedWriter(fw);
this.out = new PrintWriter(bw);
} catch (Exception e) {
System.out.println(e);
}
}
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
out.println(this.workerName + ": Hello World!");
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
In this code, my log file is located in /var/log/storm/count.log and calling out.println(text) appends the text at this end of this file. As I am not sure if it is thread-safe, all parallel threads writing at the same time into the same file might result in data loss.
Note: if your bolts are distributed accros multiple machines, each machine is going to have its own log file. During my testings, I configured a simple cluster with 1 machine (running Nimbus + Supervisor + UI), therefore I had only 1 log file.
Conclusion
There are multiple ways to deal with output data and, more generally logging anything with Storm. I didn't find any official way of doing it and documentation very light on this subject.
While some of us would be satisfied with a simple sysout.println(), others might need to push large quantity of data into specific files, or maybe in a specialized database engine. Anything you can do with Java is possible with Storm because it's simple Java programming.
Any advices and additional comments to complete this answer will be gladly appreciated.

Creating Custom Processor In Apache Nifi

I am building an custom processor to process flow file , to process the flow file i need to read an CSV file from my local file system. I created an proerty descriptor CSV_PATH as follows
public static final PropertyDescriptor CSV_PATH = new
PropertyDescriptor
.Builder().name("CSV Path")
.displayName("CSV Path")
.description("CSV Path Reader")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
#Override
protected void init(final ProcessorInitializationContext context) {
final List<PropertyDescriptor> descriptors = new
ArrayList<PropertyDescriptor>();
descriptors.add(JSON_PATH);
descriptors.add(CSV_PATH);
this.descriptors = Collections.unmodifiableList(descriptors);
final Set<Relationship> relationships = new HashSet<Relationship>();
relationships.add(SUCCESS);
this.relationships = Collections.unmodifiableSet(relationships);
}
Now I wants to get the value of CSV_PATH property set in UI while configuring processor. I am not able to get the CSV_PATH value. Also If I hardcode filepath in code then still I am not able to read CSV from local file system.
You want to use the following code to retrieve the value of the PropertyDescriptor from the ProcessContext:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final String csvPath = context.getProperty(CSV_PATH).getValue();
// Do something with csvPath
}
If you decide to support NiFi Expression Language in that property descriptor, you will also want to evaluate for that:
final String csvPath = context.getProperty(CSV_PATH).evaluateAttributeExpressions().getValue();
There are additional method overrides for that, including flowfile attributes, variable registry, custom decorators, etc.
This is documented in the Apache NiFi Developer's Guide. I recently did a presentation at Dataworks Summit Barcelona 2019 covering custom processor development with some best practices included and examples that may be helpful. You can also look at any existing processor in the NiFi codebase to see examples.

Spring batch repeat step ending up in never ending loop

I have a spring batch job that I'd like to do the following...
Step 1 -
Tasklet - Create a list of dates, store the list of dates in the job execution context.
Step 2 -
JDBC Item Reader - Get list of dates from job execution context.
Get element(0) in dates list. Use is as input for jdbc query.
Store element(0) date is job execution context
Remove element(0) date from list of dates
Store element(0) date in job execution context
Flat File Item Writer - Get element(0) date from job execution context and use for file name.
Then using a job listener repeat step 2 until no remaining dates in the list of dates.
I've created the job and it works okay for the first execution of step 2. But step 2 is not repeating as I want it to. I know this because when I debug through my code it only breaks for the initial run of step 2.
It does however continue to give me messages like below as if it is running step 2 even when I know it is not.
2016-08-10 22:20:57.842 INFO 11784 --- [ main] o.s.batch.core.job.SimpleStepHandler : Duplicate step [readStgDbAndExportMasterListStep] detected in execution of job=[exportMasterListCsv]. If either step fails, both will be executed again on restart.
2016-08-10 22:20:57.846 INFO 11784 --- [ main] o.s.batch.core.job.SimpleStepHandler : Executing step: [readStgDbAndExportMasterListStep]
This ends up in a never ending loop.
Could someone help me figure out or give a suggestion as to why my stpe 2 is only running once?
thanks in advance
I've added two links to PasteBin for my code so as not to pollute this post.
http://pastebin.com/QhExNikm (Job Config)
http://pastebin.com/sscKKWRk (Common Job Config)
http://pastebin.com/Nn74zTpS (Step execution listener)
From your question and your code I deduct that based on the amount of dates that you retrieve (this happens before the actual job starts), you will execute a step for the amount of times you have dates.
I suggest a design change. Create a java class that will get you the dates as a list and based on that list you will dynamically create your steps. Something like this:
#EnableBatchProcessing
public class JobConfig {
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Autowired
private JobDatesCreator jobDatesCreator;
#Bean
public Job executeMyJob() {
List<Step> steps = new ArrayList<Step>();
for (String date : jobDatesCreator.getDates()) {
steps.add(createStep(date));
}
return jobBuilderFactory.get("executeMyJob")
.start(createParallelFlow(steps))
.end()
.build();
}
private Step createStep(String date){
return stepBuilderFactory.get("readStgDbAndExportMasterListStep" + date)
.chunk(your_chunksize)
.reader(your_reader)
.processor(your_processor)
.writer(your_writer)
.build();
}
private Flow createParallelFlow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
// max multithreading = -1, no multithreading = 1, smart size = steps.size()
taskExecutor.setConcurrencyLimit(1);
List<Flow> flows = steps.stream()
.map(step -> new FlowBuilder<Flow>("flow_" + step.getName()).start(step).build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow")
.split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
}
EDIT: added "jobParameter" input (slightly different approach also)
Somewhere on your classpath add the following example .properties file:
sql.statement="select * from awesome"
and add the following annotation to your JobDatesCreator class
#PropertySource("classpath:example.properties")
You can provide specific sql statements as a command line argument as well. From the spring documentation:
you can launch with a specific command line switch (e.g. java -jar
app.jar --name="Spring").
For more info on that see http://docs.spring.io/spring-boot/docs/current/reference/html/boot-features-external-config.html
The class that gets your dates (why use a tasklet for this?):
#PropertySource("classpath:example.properties")
public class JobDatesCreator {
#Value("${sql.statement}")
private String sqlStatement;
#Autowired
private CommonExportFromStagingDbJobConfig commonJobConfig;
private List<String> dates;
#PostConstruct
private void init(){
// Execute your logic here for getting the data you need.
JdbcTemplate jdbcTemplate = new JdbcTemplate(commonJobConfig.onlineStagingDb);
// acces to your sql statement provided in a property file or as a command line argument
System.out.println("This is the sql statement I provided in my external property: " + sqlStatement);
// for now..
dates = new ArrayList<>();
dates.add("date 1");
dates.add("date 2");
}
public List<String> getDates() {
return dates;
}
public void setDates(List<String> dates) {
this.dates = dates;
}
}
I also noticed that you have alot of duplicate code that you can quite easily refactor. Now for each writer you have something like this:
#Bean
public FlatFileItemWriter<MasterList> division10MasterListFileWriter() {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(outDir, MerchHierarchyConstants.DIVISION_NO_10 )));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
Consider using something like this instead:
public FlatFileItemWriter<MasterList> divisionMasterListFileWriter(String divisionNumber) {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(outDir, divisionNumber )));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
As not all code is available to correctly replicate your issue, this answer is a suggestion/indication to solve your problem.
Based on our discussion on Spring batch execute dynamically generated steps in a tasklet I'm trying to answer the questions on how to access jobParameter before the job is actually being executed.
I assume that there is restcall which will execute the batch. In general, this will require the following steps to be taken.
1. a piece of code that receives the rest call with its parameters
2. creation of a new springcontext (there are ways to reuse an existing context and launch the job again but there are some issues when it comes to reuse of steps, readers and writers)
3. launch the job
The simplest solution would be to store the jobparameter received from the service as an system-property and then access this property when you build up the job in step 3. But this could lead to a problem if more than one user starts the job at the same moment.
There are other ways to pass parameters into the springcontext, when it is loaded. But that depends on the way you setup your context.
For instance, if you are using SpringBoot directly for step 2, you could write a method like:
private int startJob(Properties jobParamsAsProps) {
SpringApplication springApp = new SpringApplication(.. my config classes ..);
springApp.setDefaultProperties(jobParamsAsProps);
ConfigurableApplicationContext context = springApp.run();
ExitCodeGenerator exitCodeGen = context.getBean(ExitCodeGenerator.class);
int code = exitCodeGen.getExitCode();
context.close();
return cod;
}
This way, you could access the properties as normal with standard Value- or ConfigurationProperties Annotations.

Resources