Spring Batch - Loop reader, processor and writer for N times - spring

In Spring Batch, how to loop the reader,processor and writer for N times?
My requirement is:
I have "N" no of. customers/clients.
For each customer/client, I need to fetch the records from database (Reader), then I have to process (Processor) all records for the customer/client and then I have to write the records into a file (Writer).
How to loop the spring batch job for N times?

AFAIK I'm afraid there's no framework support for this scenario. Not at least the way you want to solve it.
I'd suggest to solve the problem differently:
Option 1
Read/Process/Write all records from all customers at once.You can only do this if they are all in the same DB. I would not recommend it otherwise, because you'll have to configure JTA/XA transactions and it's not worth the trouble.
Option 2
Run your job once for each client (best option in my opinion). Save necessary info of each client in different properties files (db data connections, values to filter records by client, whatever other data you may need specific to a client) and pass through a param to the job with the client it has to use. This way you can control which client is processed and when using bash files and/or cron. If you use Spring Boot + Spring Batch you can store the client configuration in profiles (application-clientX.properties) and run the process like:
$> java -Dspring.profiles.active="clientX" \
-jar "yourBatch-1.0.0-SNAPSHOT.jar" \
-next
Bonus - Option 3
If none of the abobe fits your needs or you insist in solving the problem they way you presented, then you can dynamically configure the job depending on parameters and creating one step for each client using JavaConf:
#Bean
public Job job(){
JobBuilder jb = jobBuilders.get("job");
for(Client c : clientsToProcess) {
jb.flow(buildStepByClient(c));
};
return jb.build();
}
Again, I strongly advise you not to go this way: ugly, against framework philosophy, hard to maintain, debug, you'll probably have to also use JTA/XA here, ...
I hope I've been of any help!

Local Partitioning will solve your problem.
In your partitioner, you will put all of your clients Ids in map as shown below ( just pseudo code ) ,
public class PartitionByClient implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> result = new HashMap<>();
int partitionNumber = 1;
for (String client: allClients) {
ExecutionContext value = new ExecutionContext();
value.putString("client", client);
result.put("Client [" + client+ "] : THREAD " + partitionNumber, value);
partitionNumber++;
}
}
return result;
}
}
This is just a pseudo code. You have to look to detailed documentation of partitioning.
You will have to mark your reader , processor and writer in #StepScope ( i.e. which ever part needs the value of your client ). Reader will use this client in WHERE clause of SQL. You will use #Value("#{stepExecutionContext[client]}") String client in reader etc definition to inject this value.
Now final piece , you will need a task executor and clients equal to concurrencyLimit will start in parallel provided you set this task executor in your master partitioner step configuration.
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor simpleTaskExecutor = new SimpleAsyncTaskExecutor();
simpleTaskExecutor.setConcurrencyLimit(concurrencyLimit);
return simpleTaskExecutor;
}
concurrencyLimit will be 1 if you wish only one client running at a time.

Related

Spring #StreamListener process(KStream<?,?> stream) Partition

I have a topic with multiple partitions in my stream processor i just wanted to stream that from one partition, and could nto figure out how to configure this
spring.cloud.stream.kafka.streams.bindings.input.consumer.application-id=s-processor
spring.cloud.stream.bindings.input.destination=uinput
spring.cloud.stream.bindings.input.group=r-processor
spring.cloud.stream.bindings.input.contentType=application/java-serialized-object
spring.cloud.stream.bindings.input.consumer.header-mode=raw
spring.cloud.stream.bindings.input.consumer.use-native-decoding=true
spring.cloud.stream.bindings.input.consumer.partitioned=true
#StreamListener(target = "input")
// #SendTo(value = { "uoutput" })
public void process(KStream<UUID, AModel> ustream) {
I want only one partition data to be processed by this processor, there will be other processors for other partition(s)
So far my finding is something to do with https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/StreamsConfig.html#PARTITION_GROUPER_CLASS_CONFIG, but couldnot find how to set this property in spring application.properties
I think the partition grouper is to group partition with tasks within a single processor. If you want to ensure that only a single partition is processed by a processor, then you need to provide at least the same number of processor instances as the topic partitions. For e.g. if your topic has 4 partitions, then you need to have 4 instances of the stream application to ensure that each instance is only processing a single partition.
Kafka Streams does not allow to read a single partition. If you subscribe to a topic, all partitions are consumed and distributed over the available instances. Thus, you can't know in advance, which partition is assigned to what instance, and all instances execute the same code.
But each partition linked to processor has different kind of data hence require different processor application
For this case, the processor (or transformer) must be able to process data for all partitions. Kafka Streams exposes the partitions number via the ProcessorContext object that is handed to a processor via init() method: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/Transformer.html#init-org.apache.kafka.streams.processor.ProcessorContext-
Thus, you need to "branch" with within your transformer to apply different processing logic based on the partition:
ustream.transform(() -> new MyTransformer());
class MyTransformer implement Transformer {
// other methods omitted
R transform(K key, V value) {
switch(context.partition()) { // get context from `init()`
case 0:
// your processing logic
break;
case 1:
// your processing logic
break;
// ...
}
}

sending input from single spout to multiple bolts with Fields grouping in Apache Storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?
It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.
An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

Spring data Neo4j Affected row count

Considering a Spring Boot, neo4j environment with Spring-Data-neo4j-4 I want to make a delete and get an error message when it fails to delete.
My problem is since the Repository.delete() returns void I have no ideia if the delete modified anything or not.
First question: is there any way to get the last query affected lines? for example in plsql I could do SQL%ROWCOUNT
So anyway, I tried the following code:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
private something getExistingsomething(Long somethingId, int depth) {
return Optional.ofNullable(somethingRepository.findOne(somethingId, depth))
.orElseThrow(() -> new somethingNotFoundException(somethingId));
}
In the code above I query the database to check if the value exist before I delete it.
Second question: do you recommend any different approach?
So now, just to add some complexity, I have a cluster database and db1 can only Create, Update and Delete, and db2 and db3 can only Read (this is ensured by the cluster sockets). db2 and db3 will receive the data from db1 from the replication process.
For what I seen so far replication can take up to 90s and that means that up to 90s the database will have a different state.
Looking again to the code above:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
in debug that means:
getExistingsomething(somethingId).getId() // will hit db2
somethingRepository.delete(...) // will hit db1
and so if replication has not inserted the value in db2 this code wil throw the exception.
the second question is: without changing those sockets is there any way for me to delete and give the correct response?
This is not currently supported in Spring Data Neo4j, if you wish please open a feature request.
In the meantime, perhaps the easiest work around is to fall down to the OGM level of abstraction.
Create a class that is injected with org.neo4j.ogm.session.Session
Use the following method on Session
Example: (example is in Kotlin, which was on hand)
fun deleteProfilesByColor(color : String)
{
var query = """
MATCH (n:Profile {color: {color}})
DETACH DELETE n;
"""
val params = mutableMapOf(
"color" to color
)
val result = session.query(query, params)
val statistics = result.queryStatistics() //Use these!
}

Performing bulk load in cassandra with map reduce

I haven't got much experience working with cassandra, so please excuse me if I have put in a wrong approach.
I am trying to do bulk load in cassandra with map reduce
Basically the word count example
Reference : http://henning.kropponline.de/2012/11/15/using-cassandra-hadoopbulkoutputformat/
I have put the simple Hadoop Wordcount Mapper Example and slightly modified the driver code and the reducer as per the above example.
I have successfully generated the output file as well. Now my doubt is how to perform the loading to cassandra part? Is there any difference in my approach ?
Please advice.
This is a part of the driver code
Job job = new Job();
job.setJobName(getClass().getName());
job.setJarByClass(CassaWordCountJob.class);
Configuration conf = job.getConfiguration();
conf.set("cassandra.output.keyspace", "test");
conf.set("cassandra.output.columnfamily", "words");
conf.set("cassandra.output.partitioner.class", "org.apache.cassandra.dht.RandomPartitioner");
conf.set("cassandra.output.thrift.port","9160"); // default
conf.set("cassandra.output.thrift.address", "localhost");
conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "400");
job.setMapperClass(CassaWordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(CassaWordCountReducer.class);
FileOutputFormat.setOutputPath(job, new Path("/home/user/Desktop/test/cassandra"));
MultipleOutputs.addNamedOutput(job, "reducer", BulkOutputFormat.class, ByteBuffer.class, List.class);
return job.waitForCompletion(true) ? 0 : 1;
Mapper is the same as the normal wordcount mapper that just tokenizes and emits Word, 1
The reducer class is of the form
public class CassaWordCountReducer extends
Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
List<Mutation> columnsToAdd = new ArrayList<Mutation>();
Integer wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
}
Column countCol = new Column(ByteBuffer.wrap("count".getBytes()));
countCol.setValue(ByteBuffer.wrap(wordCount.toString().getBytes()));
countCol.setTimestamp(new Date().getTime());
ColumnOrSuperColumn wordCosc = new ColumnOrSuperColumn();
wordCosc.setColumn(countCol);
Mutation countMut = new Mutation();
countMut.column_or_supercolumn = wordCosc;
columnsToAdd.add(countMut);
context.write(ByteBuffer.wrap(key.toString().getBytes()), columnsToAdd);
}
}
To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat, it should do all that as it's using the sstableloader behind the scenes. I've never used it with MultipleOutputs, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs correctly: you're still doing a context.write, when you should really be writing to your MultipleOutputs object. The way you're doing it right now, since you're writing to the regular Context, it will get picked up by the default output format of TextOutputFormat and not the one you defined in your MultipleOutputs. More information on how to use the MultipleOutputs in your reducer here.
Once you write to the correct output format of BulkOutputFormat like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat, but they're using a ConfigHelper which you might want to take a look at to more easily configure your Cassandra endpoint.

Hbase - Hadoop : TableInputFormat extension

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.
You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

Resources