EHCACHE in memeory Cache hit count is showing twice for every hit in camel Fuse - ehcache

in memory hits count is showing twice the expected value in EHCACHE . is this expected behavior? When i am hitting it for the first time it correctly shows cacheMiss count as 1, but the subsequesnt hits which it reads from the cahe the hit count is always 2 for everyhit in the cache.
Am first trying to GET data the data from Cache, if not found then ADD it in the cache. In this process Hit is twice that miss.
<route handleFault="true" streamCache="true" id="MainRoute">
<from uri="direct:start"/>
<setHeader headerName="CamelCacheOperation">
<constant>CamelCacheGet</constant>
</setHeader>
<setHeader headerName="CamelCacheKey">
<simple>${property.employeeId}</simple>
</setHeader>
<to uri="cache://EmployeeCache"/>
<choice>
<when>
<simple>${header.CamelCacheElementWasFound} == null</simple>
<to uri="direct:AddDataRoute"/>
<setHeader headerName="CamelCacheOperation">
<constant>CamelCacheAdd</constant>
</setHeader>
<setHeader headerName="CamelCacheKey">
<simple>${property.employeeId}</simple>
</setHeader>
<to uri="cache://EmployeeCache"/>
</when>
</choice>
</route>
=====
Fuse Ehcache Statistics for the above senario

I suggest looking into the code accessing the cache. There is no know bug in the latest releases where each cache hit is counted twice.

Related

Determine the end of a cyclic workflow in spring integration (inbound-channel => service-activator)

We have the following simple int-jpa based workflow:
[inbound-channel-adapter] -> [service-activator]
The config is like this:
<int:channel id="inChannel"> <int:queue/> </int:channel>
<int:channel id="outChannel"> <int:queue/> </int:channel>
<int-jpa:inbound-channel-adapter id="inChannelAdapter" channel="inChannel"
jpa-query="SOME_COMPLEX_POLLING_QUERY"
max-results="2">
<int:poller max-messages-per-poll="2" fixed-rate="20" >
<int:advice-chain synchronization-factory="txSyncFactory" >
<tx:advice transaction-manager="transactionManager" >
<tx:attributes>
<tx:method name="*" timeout="30000" />
</tx:attributes>
</tx:advice>
<int:ref bean="pollerAdvice"/>
</int:advice-chain>
</int-jpa:inbound-channel-adapter>
<int:service-activator input-channel="inChannel" ref="myActivator"
method="pollEntry" output-channel="outChannel" />
<bean id="myActivator" class="com.company.myActivator" />
<bean id="pollerAdvice" class="com.company.myPollerAdvice" />
The entry point for processing is a constantly growing table against which the SOME_COMPLEX_POLLING_QUERY is run. The current flow is :
[Thread-1] The SOME_COMPLEX_POLLING_QUERY will only return entries that has busy set to false (we set busy to true as soon as polling is done using txSyncFactory)
[Thread-2] These entries will pass through the myActivator where it might take anywhere from 1 min to 30 mins.
[Thread-2] Once the processing is done, we set back the busy from true to false
Problem: We need to trigger a notification even when the processing of all the entries that were present in the table is done.
Approach tried: We used the afterReturning of pollerAdvice to find out if the SOME_COMPLEX_POLLING_QUERY returned any results or not. However this method will start returning "No Entries" way before the Thread-2 is done processing all the entries.
Note:
The same entries will be processes again after 24hrs. But this time it will have more entries.
We are not using outbound-channel-adapter, since we dont have any requirement for it. However, we are open to use it, if that is a part of the solution proposed.
Not sure if that will work for you, but since you still need to wait with the notification until Thread-2, I would suggest to have some AtomicBoolean bean. In the mentioned afterReturning(), when there is no data polled from the DB, you just change the state of the AtomicBoolean to true. When the Thread-2 finishes its work, it can call <filter> to check the state of the AtomicBoolean and then really perform an <int-event:outbound-channel-adapter> to emit a notification event.
So, the final decision to emit event or not is definitely done from the Thread-2, not polling channel adapter.

records being processed twice

We have a spring batch which reach a bunch of data in the reader, processes it and writes it. It all happens as a batch.
I noticed that the processor and writer are going over the same data twice, once as a batch and once individual records.
for ex: writer reads 1000 records, sends 1000 records to the processor, sends 1000 records to the writer.
After this the records gets processed again, individually, but only processor and writer are being called.
We have log statements in all reader, processor, and writer and I can see the logs.
Is there any condition which can make the records being processed individually after they have been processed as a list?
<batch:job id="feeder-job">
<batch:step id="regular">
<tasklet>
<chunk reader="feeder-reader" processor="feeder-processor"
writer="feeder-composite-writer" commit-interval="#{stepExecutionContext['query-fetch-size']}"
skip-limit="1000000">
<skippable-exception-classes>
<include class="java.lang.Exception" />
<exclude class="org.apache.solr.client.solrj.SolrServerException"/>
<exclude class="org.apache.solr.common.SolrException"/>
<exclude class="com.batch.feeder.record.RecordFinderException"/>
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="feeder-reader" />
</listeners>
</tasklet>
</batch:step>
</batch:job>
You should well read about a feature before using it. Here you are correct that processing is happening twice only after error occurs.
Basically, you have defined a chunk / step which is fault tolerant to certain specified exceptions Configuring Skip Logic
Your step will not fail till total exception count remains below skip-limitbut on errors, chunk items will be processed twice - one by one, the second time and skipping bad records in second processing.

Batch processing in jdbc gateway

my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="#db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="#finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
poller is triggered by cron. message is enriched with some information about this run.
message is sent to multiple data-source chains.
each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
resultset is split into smaller chunks, transformed and inserted into db2.
after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?
First of all since version 4.0.4 Spring Integration's <splitter> supports Iterator as payload to avoid memory overhead.
We have a test-case for the JDBC which shows that behaviour. But as you see it is based on the Spring Integration Java DSL and Java 8 Lamdas. (Yes, it can be done even for older Java versions without Lamdas). Even if this case is appropriate for you, your <aggregator> should not be in-memory, because it collects all messages to the MessageStore.
That's first case.
Another option is based on the paging algorithm, when your SELECT accepts a pair of WHERE params in the your DB dialect. For Oracle it can be like: Paging with Oracle.
Where the pageNumber is some message header - :headers[pageNumber]
After that you do some trick with <recipient-list-router> to send a SELECT result to the save channel and to some other channel wich increments pageNumber header value and sends a message to the data.source.1 channel and so on. When the pageNumber becomes out of data scope, the <int-jdbc:outbound-gateway> stops produces results.
Something like that.
I don't say that it so easy, but it should be a start point for you, at least.

Prevent duplicates across restarts in spring integration

I have to poll a directory and write entries to rdbms.
I wired up a redis metadatstore for duplicates check. I see that the framework updates the redis store with entries for all files in the folder [~ 140 files], much before the rdbms entries gets written. At the time of application termination, rdbms has logged only 90 files. On application restart no more files are picked from folder.
Properties: msgs.per.poll=10, polling.interval=2000
How can I ensure entries to redis are made after writing to db, so that both are in sync and I don't miss any files.
<code>
<task:executor id="executor" pool-size="5" />
<int-file:inbound-channel-adapter channel="filesIn" directory="${input.Dir}" scanner="dirScanner" filter="compositeFileFilter" prevent-duplicates="true">
<int:poller fixed-delay="${polling.interval}" max-messages-per-poll="${msgs.per.poll}" task-executor="executor">
</int:poller>
</int-file:inbound-channel-adapter>
<int:channel id="filesIn" />
<bean id="dirScanner" class="org.springframework.integration.file.RecursiveLeafOnlyDirectoryScanner" />
<bean id="compositeFileFilter" class="org.springframework.integration.file.filters.CompositeFileListFilter">
<constructor-arg ref="persistentFilter" />
</bean>
<bean id="persistentFilter" class="org.springframework.integration.file.filters.FileSystemPersistentAcceptOnceFileListFilter">
<constructor-arg ref="metadataStore" />
</bean>
<bean name="metadataStore" class="org.springframework.integration.redis.metadata.RedisMetadataStore">
<constructor-arg name="connectionFactory" ref="redisConnectionFactory"/>
</bean>
<bean id="redisConnectionFactory" class="org.springframework.data.redis.connection.jedis.JedisConnectionFactory" p:hostName="localhost" p:port="6379" />
<int-jdbc:outbound-channel-adapter channel="filesIn" data-source="dataSource" query="insert into files values (:path,:name,:size,:crDT,:mdDT,:id)"
sql-parameter-source-factory="spelSource">
</int-jdbc:outbound-channel-adapter>
....
</code>
Artem is correct, you might as well extend the RedisMetadataStore and flush the entries that are not in your database on initialization time, this way you could use Redis and be in sync with the DB. But this kind of couples things a little.
How can I ensure entries to redis are made after writing to db
It's isn't possible, because FileSystemPersistentAcceptOnceFileListFilter works before any message sending and only once, when FileReadingMessageSource.toBeReceived is empty. Of course, it tries to refetch files on the next application restart, but it can't do that because your RedisMetadataStore already contains entries for those files.
I think we don't have in your case any choice unless use some custom JdbcFileListFilter based on your files table. Fortunately you logic ends up with file entry anyway.

Spring batch admin remote partition steps running maximum 8 threads even though concurrency is 10?

I am using spring batch remote partitioning for batch process. I am launching jobs using spring batch admin.
I have inbound gateway consumer concurrency step to 10 but maximum number of partitions running in parallel are 8.
I want to increase the consumer concurrency to 15 later on.
Below is my configuration,
<task:executor id="taskExecutor" pool-size="50" />
<rabbit:template id="computeAmqpTemplate"
connection-factory="rabbitConnectionFactory" routing-key="computeQueue"
reply-timeout="${compute.partition.timeout}">
</rabbit:template>
<int:channel id="computeOutboundChannel">
<int:dispatcher task-executor="taskExecutor" />
</int:channel>
<int:channel id="computeInboundStagingChannel" />
<amqp:outbound-gateway request-channel="computeOutboundChannel"
reply-channel="computeInboundStagingChannel" amqp-template="computeAmqpTemplate"
mapped-request-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS"
mapped-reply-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS" />
<beans:bean id="computeMessagingTemplate"
class="org.springframework.integration.core.MessagingTemplate"
p:defaultChannel-ref="computeOutboundChannel"
p:receiveTimeout="${compute.partition.timeout}" />
<beans:bean id="computePartitionHandler"
class="org.springframework.batch.integration.partition.MessageChannelPartitionHandler"
p:stepName="computeStep" p:gridSize="${compute.grid.size}"
p:messagingOperations-ref="computeMessagingTemplate" />
<int:aggregator ref="computePartitionHandler"
send-partial-result-on-expiry="true" send-timeout="${compute.step.timeout}"
input-channel="computeInboundStagingChannel" />
<amqp:inbound-gateway concurrent-consumers="${compute.consumer.concurrency}"
request-channel="computeInboundChannel"
reply-channel="computeOutboundStagingChannel" queue-names="computeQueue"
connection-factory="rabbitConnectionFactory"
mapped-request-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS"
mapped-reply-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS" />
<int:channel id="computeInboundChannel" />
<int:service-activator ref="stepExecutionRequestHandler"
input-channel="computeInboundChannel" output-channel="computeOutboundStagingChannel" />
<int:channel id="computeOutboundStagingChannel" />
<beans:bean id="computePartitioner"
class="org.springframework.batch.core.partition.support.MultiResourcePartitioner"
p:resources="file:${spring.tmp.batch.dir}/#{jobParameters[batch_id]}/shares_rics/shares_rics_*.txt"
scope="step" />
<beans:bean id="computeFileItemReader"
class="org.springframework.batch.item.file.FlatFileItemReader"
p:resource="#{stepExecutionContext[fileName]}" p:lineMapper-ref="stLineMapper"
scope="step" />
<beans:bean id="computeItemWriter"
class="com.st.batch.foundation.writers.ComputeItemWriter"
p:symfony-ref="symfonyStepScoped" p:timeout="${compute.item.timeout}"
p:batchId="#{jobParameters[batch_id]}" scope="step" />
<step id="computeStep">
<tasklet transaction-manager="transactionManager">
<chunk reader="computeFileItemReader" writer="computeItemWriter"
commit-interval="${compute.commit.interval}" />
</tasklet>
</step>
<flow id="computeFlow">
<step id="computeStep.master">
<partition partitioner="computePartitioner"
handler="computePartitionHandler" />
</step>
</flow>
<job id="computeJob" restartable="true">
<flow id="computeJob.computeFlow" parent="computeFlow" />
</job>
compute.grid.size = 112
compute.consumer.concurrency = 10
Input files are splited to 112 equal parts = compute.grid.size = total number of partitions
Number of servers = 4.
There are 2 problems,
i) Even though I have set concurrency to 10, maximum number of threads running are 8.
ii)
some are slower as other processes runs on them and some are faster so I want make sure step executions are distributed fairly i.e. if faster servers are done with their execution, other remaining executions in queue should go to them . It should not be distributed round robbin fashion.
I know in rabbitmq there is prefetch count setting and ack mode to distribute farely. For spring integration, prefetch count is 1 default and ack mode is AUTO by default. But still some servers keeps running more partitions even though other servers are done for long time. Ideally no servers should be sitting idle.
Update:
One more thing I now observed is that, for some steps which runs in parallel using split (not distributed using remote partitioning) also run max 8 in parallel. It looks something like thread pool limit issue but as you can see taskExecutor has pool-size set to 50.
Is there anything in spring-batch/spring-batch-admin which limits number of concurrently running steps ?
2nd Update:
And, if there are 8 or more threads running in parallel processing items, spring batch admin doesn't load. It just hangs. If I reduce concurrency, spring batch admin loads. I even tested it with setting concurrency 4 on one server and 8 on other server, spring batch admin doesn't load it I use URL of server where 8 threads are running but it works on the server where 4 threads are running.
Spring batch admin manager has below jobLauncher configuration,
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository" />
<property name="taskExecutor" ref="jobLauncherTaskExecutor" />
</bean>
<task:executor id="jobLauncherTaskExecutor" pool-size="6" rejection-policy="ABORT" />
The pool size is 6 there, has it anything to do with above problem ?
Or is there anything in tomcat 7 which restricts number of threads running to 8 ?
Are you using a database for JobRepository?
During the execution, the batch frameworks persists step executions and number of connections to the JobRepository database can interfere in parallel step executions.
Concurrency of 8 makes me thinks you might be using BasicDataSource? If so, switch to something like DriverManagerDataSource and see.
Confused - you said "I have set the concurrency to 10" but then show compute.consumer.concurrency = 8. So it is working as configured. It is impossible to have only 8 consumer threads if the property is set to 10.
From Rabbit's perspective, all consumers are equal - if there are 10 consumers on a slow box and 10 consumers on a fast box, and you only have 10 partitions, it is possible that all 10 partitions will end up on the slow box.
RabbitMQ does not distribute work across servers, it distributes the work across consumers only.
You might get better distribution by reducing the concurrency. You should also set the concurrency lower on the slower boxes.

Resources