java 8 parallel stream with ForkJoinPool and ThreadLocal - java-8

We are using java 8 parallel stream to process a task, and we are submitting the task through ForkJoinPool#submit. We are not using jvm wide ForkJoinPool.commonPool, instead we are creating our own custom pool to specify the parallelism and storing it as static variable.
We have validation framework, where we subject a list of tables to a List of Validators, and we submit this job through the custom ForkJoinPool as follows:
static ForkJoinPool forkJoinPool = new ForkJoinPool(4);
List<Table> tables = tableDAO.findAll();
ModelValidator<Table, ValidationResult> validator = ValidatorFactory
.getInstance().getTableValidator();
List<ValidationResult> result = forkJoinPool.submit(
() -> tables.stream()
.parallel()
.map(validator)
.filter(result -> result.getValidationMessages().size() > 0)
.collect(Collectors.toList())).get();
The problem we are having is, in the downstream components, the individual validators which run on separate threads from our static ForkJoinPool rely on tenant_id, which is different for every request and is stored in an InheritableThreadLocal variable. Since we are creating a static ForkJoinPool, the threads pooled by the ForkJoinPool will only inherit the value of the parent thread, when it is created first time. But these pooled threads will not know the new tenant_id for the current request. So for subsequent execution these pooled threads are using old tenant_id.
I tried creating a custom ForkJoinPool and specifying ForkJoinWorkerThreadFactory in the constructor and overriding the onStart method to feed the new tenant_id. But that doesnt work, since the onStart method is called only once at creation time and not during individual execution time.
Seems like we need something like the ThreadPoolExecutor#beforeExecute which is not available in case of ForkJoinPool. So what alternative do we have if we want to pass the current thread local value to the statically pooled threads?
One workaround would be to create the ForkJoinPool for each request, rather than make it static but we wouldn't want to do it, to avoid the expensive nature of thread creation.
What alternatives do we have?

I found the following solution that works without changing any underlying code. Basically, the map method takes a functional interface which I am representing as a lambda expression. This expression adds a preExecution hook to set the new tenantId in the current ThreadLocal and cleaning it up in postExecution.
forkJoinPool.submit(tables.stream()
.parallel()
.map((item) -> {
preExecution(tenantId);
try {
return validator.apply(item);
} finally {
postExecution();
}
}
)
.filter(validationResult ->
validationResult.getValidationMessages()
.size() > 0)
.collect(Collectors.toList())).get();

The best option in my view would be to get rid of the thread local and pass it as an argument instead. I understand that this could be a massive undertaking though. Another option would be to use a wrapper.
Assuming that your validator has a validate method you could do something like:
public class WrappingModelValidator implements ModelValidator<Table. ValidationResult> {
private final ModelValidator<Table. ValidationResult> v;
private final String tenantId;
public WrappingModelValidator(ModelValidator<Table. ValidationResult> v, String tenantId) {
this.v = v;
this.tenantId = tenantId;
}
public ValidationResult validate(Table t) {
String oldValue = YourThreadLocal.get();
YourThreadLocal.set(tenantId);
try {
return v.validate(t);
} finally {
YourThreadLocal.set(oldValue);
}
}
}
Then you simply wrap your old validator and it will set the thread local on entry and restore it when done.

Related

Reactor Flux conditional emit

Is it possible to allow emitting values from a Flux conditionally based on a global boolean variable?
I'm working with Flux delayUntil(...) but not able to fully grasp the functionality or my assumptions are wrong.
I have a global AtomicBoolean that represents the availability of a downstream connection and only want the upstream Flux to emit if the downstream is ready to process.
To represent the scenario, created a (not working) test sample
//Randomly generates a boolean value every 5 seconds
private Flux<Boolean> signalGenerator() {
return Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(5000))
.map(integer -> new Random().nextBoolean());
}
and
Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(1000))
.delayUntil(evt -> signalGenerator()) // ?? Only proceed when signalGenerator returns true
.subscribe(System.out::println);
I have another scenario where a downstream process can accept only x messages a second. In the current non-reactive implementation we have a Semaphore of x permits and the thread is blocked if no more permits are available, with Semaphore permits resetting every second.
In both scenarios I want upstream Flux to emit only when there is a demand from the downstream process, and I do not want to Buffer.
You might consider using Mono.fromRunnable() as an input to delayUntil() like below;
Helper class;
public class FluxCondition {
CountDownLatch latch = new CountDownLatch(10); // it depends, might be managed somehow
Runnable r = () -> { latch.await(); }
public void lock() { Mono.fromRunnable(r) };
public void release() { latch.countDown(); }
}
Usage;
FluxCondition delayCondition = new FluxCondition();
Flux.range(1, 10).delayUntil(o -> delayCondition.lock()).subscribe();
.....
delayCondition.release(); // shall call this for each element
I guess there might be a better solution by using sink.emitNext but this might also require a condition variable for controlling Flux flow.
According my understanding, in reactive programming, your data should be considered in every operator step. So it might be better for you to design your consumer as a reactive processor. In my case I had no chance and followed the way as I described above

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.
Typically this happens for two reasons:
The local KafkaStreams instance is not yet ready (i.e., not yet in
runtime state RUNNING, see Run-time Status Information) and thus its
local state stores cannot be queried yet. The local KafkaStreams
instance is ready (e.g. in runtime state RUNNING), but the particular
state store was just migrated to another instance behind the scenes.
This may notably happen during the startup phase of a distributed
application or when you are adding/removing application instances.
https://docs.confluent.io/platform/current/streams/faq.html#handling-invalidstatestoreexception-the-state-store-may-have-migrated-to-another-instance
The simplest approach is to guard against InvalidStateStoreException when calling KafkaStreams#store():
// Example: Wait until the store of type T is queryable. When it is, return a reference to the store.
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}

Passing data to dependencies registered with Execution Context Scope lifetime in Simple Injector

Is there a way to pass data to dependencies registered with either Execution Context Scope or Lifetime Scope in Simple Injector?
One of my dependencies requires a piece of data in order to be constructed in the dependency chain. During HTTP and WCF requests, this data is easy to get to. For HTTP requests, the data is always present in either the query string or as a Request.Form parameter (and thus is available from HttpContext.Current). For WCF requests, the data is always present in the OperationContext.Current.RequestContext.RequestMessage XML, and can be parsed out. I have many command handler implementations that depend on an interface implementation that needs this piece of data, and they work great during HTTP and WCF scoped lifestyles.
Now I would like to be able to execute one or more of these commands using the Task Parallel Library so that it will execute in a separate thread. It is not feasible to move the piece of data out into a configuration file, class, or any other static artifact. It must initially be passed to the application either via HTTP or WCF.
I know how to create a hybrid lifestyle using Simple Injector, and already have one set up as hybrid HTTP / WCF / Execution Context Scope (command interfaces are async, and return Task instead of void). I also know how to create a command handler decorator that will start a new Execution Context Scope when needed. The problem is, I don't know how or where (or if I can) "save" this piece of data so that is is available when the dependency chain needs it to construct one of the dependencies.
Is it possible? If so, how?
Update
Currently, I have an interface called IProvideHostWebUri with two implementations: HttpHostWebUriProvider and WcfHostWebUriProvider. The interface and registration look like this:
public interface IProvideHostWebUri
{
Uri HostWebUri { get; }
}
container.Register<IProvideHostWebUri>(() =>
{
if (HttpContext.Current != null)
return container.GetInstance<HttpHostWebUriProvider>();
if (OperationContext.Current != null)
return container.GetInstance<WcfHostWebUriProvider>();
throw new NotSupportedException(
"The IProvideHostWebUri service is currently only supported for HTTP and WCF requests.");
}, scopedLifestyle); // scopedLifestyle is the hybrid mentioned previously
So ultimately unless I gut this approach, my goal would be to create a third implementation of this interface which would then depend on some kind of context to obtain the Uri (which is just constructed from a string in the other 2 implementations).
#Steven's answer seems to be what I am looking for, but I am not sure how to make the ITenantContext implementation immutable and thread-safe. I don't think it will need to be made disposable, since it just contains a Uri value.
So what you are basically saying is that:
You have an initial request that contains some contextual information captured in the request 'header'.
During this request you want to kick off a background operation (on a different thread).
The contextual information from the initial request should stay available when running in the background thread.
The short answer is that Simple Injector does not contain anything that allows you to do so. The solution is in creating a piece of infrastructure that allows moving this contextual information along.
Say for instance you are processing command handlers (wild guess here ;-)), you can specify a decorator as follows:
public class BackgroundProcessingCommandHandlerDecorator<T> : ICommandHandler<T>
{
private readonly ITenantContext tenantContext;
private readonly Container container;
private readonly Func<ICommandHandler<T>> decorateeFactory;
public BackgroundProcessingCommandHandlerDecorator(ITenantContext tenantContext,
Container container, Func<ICommandHandler<T>> decorateeFactory) {
this.tenantContext = tenantContext;
this.container = container;
this.decorateeFactory = decorateeFactory;
}
public void Handle(T command) {
// Capture the contextual info in a local variable
// NOTE: This object must be immutable and thread-safe.
var tenant = this.tenantContext.CurrentTenant;
// Kick off a new background operation
Task.Factory.StartNew(() => {
using (container.BeginExecutionContextScope()) {
// Load a service that allows setting contextual information
var context = this.container.GetInstance<ITenantContextApplier>();
// Set the context for this thread, before resolving the handler
context.SetCurrentTenant(tenant);
// Resolve the handler
var decoratee = this.decorateeFactory.Invoke();
// And execute it.
decoratee.Handle(command);
}
});
}
}
Note that in the example I make use of an imaginary ITenantContext abstraction, assuming that you need to supply the commands with information about the current tenant, but any other sort of contextual information will obviously do as well.
The decorator is a small piece of infrastructure that allows you to process commands in the background and it is its responsibility to make sure all the required contextual information is moved to the background thread as well.
To be able to do this, the contextual information is captured and used as a closure in the background thread. I created an extra abstraction for this, namely ITenantContextApplier. Do note that the tenant context implementation can implement both the ITenantContext and the ITenantContextApplier interface. If however you define the ITenantContextApplier in your composition root, it will be impossible for the application to change the context, since it does not have a dependency on ITenantContextApplier.
Here's an example:
// Base library
public interface ITenantContext { }
// Business Layer
public class SomeCommandHandler : ICommandHandler<Some> {
public SomeCommandHandler(ITenantContext context) { ... }
}
// Composition Root
public static class CompositionRoot {
// Make the ITenantContextApplier private so nobody can see it.
// Do note that this is optional; there's no harm in making it public.
private interface ITenantContextApplier {
void SetCurrentTenant(Tenant tenant);
}
private class AspNetTenantContext : ITenantContextApplier, ITenantContext {
// Implement both interfaces
}
private class BackgroundProcessingCommandHandlerDecorator<T> { ... }
public static Container Bootstrap(Container container) {
container.RegisterPerWebRequest<ITenantContext, AspNetTenantContext>();
container.Register<ITenantContextApplier>(() =>
container.GetInstance<ITenantContext>() as ITenantContextApplier);
container.RegisterDecorator(typeof(ICommandHandler<>),
typeof(BackgroundProcessingCommandHandlerDecorator<>));
}
}
A different approach would be to just make the complete ITenantContext available to the background thread, but to be able to pull this off, you need to make sure that:
The implementation is immutable and thus thread-safe.
The implementation doesn't require disposing, because it will typically be disposed when the original request ends.

What's the best way to pass a huge collection to a Spring Batch Step?

Use case:
A one-time read of data set X (from database) into a Collection C. [Collection size could be say 5000]
Use Collection C to process/enrich items in a Spring Batch Step (say enrichStep)
If C is much greater than what can be passed via ExecutionContext, how can we make it available in the ItemProcessor of the enrichStep?
In your enrichStep add a StepExecutionListener.beforeStep and load your huge collection in a HugeCollectionBeanHolder bean.
In this way you will load collection only once (when step start or re-start) and without persist it into execution context.
In your enrich processor wire the HugeCollectionBeanHolder to access huge collection.
class HugeCollectionBeanHolder {
Collection<Item> hudeCollection;
void setHugeCollection(Collection<Item> c) { this.hugeCollection = c;}
Collection<Item> getHugeCollection() { return this.hugeCollection;}
}
class MyProcessor implements ItemProcessor<Input,Output> {
HugeCollectionBeanHolder hcbh;
void setHugeCollectionBeanHolder(HugeCollectionBeanHolder bean) { this.hcbh = bean;}
// other methods...
}
You can also look at Spring Batch: what is the best way to use, the data retrieved in one TaskletStep, in the processing of another step

Non-Blocking Endpoint: Returning an operation ID to the caller - Would like to get your opinion on my implementation?

Boot Pros,
I recently started to program in spring-boot and I stumbled upon a question where I would like to get your opinion on.
What I try to achieve:
I created a Controller that exposes a GET endpoint, named nonBlockingEndpoint. This nonBlockingEndpoint executes a pretty long operation that is resource heavy and can run between 20 and 40 seconds.(in the attached code, it is mocked by a Thread.sleep())
Whenever the nonBlockingEndpoint is called, the spring application should register that call and immediatelly return an Operation ID to the caller.
The caller can then use this ID to query on another endpoint queryOpStatus the status of this operation. At the beginning it will be started, and once the controller is done serving the reuqest it will be to a code such as SERVICE_OK. The caller then knows that his request was successfully completed on the server.
The solution that I found:
I have the following controller (note that it is explicitely not tagged with #Async)
It uses an APIOperationsManager to register that a new operation was started
I use the CompletableFuture java construct to supply the long running code as a new asynch process by using CompletableFuture.supplyAsync(() -> {}
I immdiatelly return a response to the caller, telling that the operation is in progress
Once the Async Task has finished, i use cf.thenRun() to update the Operation status via the API Operations Manager
Here is the code:
#GetMapping(path="/nonBlockingEndpoint")
public #ResponseBody ResponseOperation nonBlocking() {
// Register a new operation
APIOperationsManager apiOpsManager = APIOperationsManager.getInstance();
final int operationID = apiOpsManager.registerNewOperation(Constants.OpStatus.PROCESSING);
ResponseOperation response = new ResponseOperation();
response.setMessage("Triggered non-blocking call, use the operation id to check status");
response.setOperationID(operationID);
response.setOpRes(Constants.OpStatus.PROCESSING);
CompletableFuture<Boolean> cf = CompletableFuture.supplyAsync(() -> {
try {
// Here we will
Thread.sleep(10000L);
} catch (InterruptedException e) {}
// whatever the return value was
return true;
});
cf.thenRun(() ->{
// We are done with the super long process, so update our Operations Manager
APIOperationsManager a = APIOperationsManager.getInstance();
boolean asyncSuccess = false;
try {asyncSuccess = cf.get();}
catch (Exception e) {}
if(true == asyncSuccess) {
a.updateOperationStatus(operationID, Constants.OpStatus.OK);
a.updateOperationMessage(operationID, "success: The long running process has finished and this is your result: SOME RESULT" );
}
else {
a.updateOperationStatus(operationID, Constants.OpStatus.INTERNAL_ERROR);
a.updateOperationMessage(operationID, "error: The long running process has failed.");
}
});
return response;
}
Here is also the APIOperationsManager.java for completness:
public class APIOperationsManager {
private static APIOperationsManager instance = null;
private Vector<Operation> operations;
private int currentOperationId;
private static final Logger log = LoggerFactory.getLogger(Application.class);
protected APIOperationsManager() {}
public static APIOperationsManager getInstance() {
if(instance == null) {
synchronized(APIOperationsManager.class) {
if(instance == null) {
instance = new APIOperationsManager();
instance.operations = new Vector<Operation>();
instance.currentOperationId = 1;
}
}
}
return instance;
}
public synchronized int registerNewOperation(OpStatus status) {
cleanOperationsList();
currentOperationId = currentOperationId + 1;
Operation newOperation = new Operation(currentOperationId, status);
operations.add(newOperation);
log.info("Registered new Operation to watch: " + newOperation.toString());
return newOperation.getId();
}
public synchronized Operation getOperation(int id) {
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
return op;
}
}
Operation notFound = new Operation(-1, OpStatus.INTERNAL_ERROR);
notFound.setCrated(null);
return notFound;
}
public synchronized void updateOperationStatus (int id, OpStatus newStatus) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setStatus(newStatus);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
public synchronized void updateOperationMessage (int id, String message) {
iteration : for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if(op.getId() == id) {
op.setMessage(message);
log.info("Updated Operation status: " + op.toString());
break iteration;
}
}
}
private synchronized void cleanOperationsList() {
Date now = new Date();
for(Iterator<Operation> iterator = operations.iterator(); iterator.hasNext();) {
Operation op = iterator.next();
if((now.getTime() - op.getCrated().getTime()) >= Constants.MIN_HOLD_DURATION_OPERATIONS ) {
log.info("Removed operation from watchlist: " + op.toString());
iterator.remove();
}
}
}
}
The questions that I have
Is that concept a valid one that also scales? What could be improved?
Will i run into concurrency issues / race conditions?
Is there a better way to achieve the same in boot spring, but I just didn't find that yet? (maybe with the #Async directive?)
I would be very happy to get your feedback.
Thank you so much,
Peter P
It is a valid pattern to submit a long running task with one request, returning an id that allows the client to ask for the result later.
But there are some things I would suggest to reconsider :
do not use an Integer as id, as it allows an attacker to guess ids and to get the results for those ids. Instead use a random UUID.
if you need to restart your application, all ids and their results will be lost. You should persist them to a database.
Your solution will not work in a cluster with many instances of your application, as each instance would only know its 'own' ids and results. This could also be solved by persisting them to a database or Reddis store.
The way you are using CompletableFuture gives you no control over the number of threads used for the asynchronous operation. It is possible to do this with standard Java, but I would suggest to use Spring to configure the thread pool
Annotating the controller method with #Async is not an option, this does not work no way. Instead put all asynchronous operations into a simple service and annotate this with #Async. This has some advantages :
You can use this service also synchronously, which makes testing a lot easier
You can configure the thread pool with Spring
The /nonBlockingEndpoint should not return the id, but a complete link to the queryOpStatus, including id. The client than can directly use this link without any additional information.
Additionally there are some low level implementation issues which you may also want to change :
Do not use Vector, it synchronizes on every operation. Use a List instead. Iterating over a List is also much easier, you can use for-loops or streams.
If you need to lookup a value, do not iterate over a Vector or List, use a Map instead.
APIOperationsManager is a singleton. That makes no sense in a Spring application. Make it a normal PoJo and create a bean of it, get it autowired into the controller. Spring beans by default are singletons.
You should avoid to do complicated operations in a controller method. Instead move anything into a service (which may be annotated with #Async). This makes testing easier, as you can test this service without a web context
Hope this helps.
Do I need to make database access transactional ?
As long as you write/update only one row, there is no need to make this transactional as this is indeed 'atomic'.
If you write/update many rows at once you should make it transactional to guarantee, that either all rows are updated or none.
However, if two operations (may be from two clients) update the same row, always the last one will win.

Resources