Is there a way to limit size of a queue in ActiveMQ. Like I have four queues: Q1, Q2, Q3, Q4 and I want when Q3 has 200 MB of messages it should block untill messages are not consumed, but other Q1, Q2, Q4 function normally.
You can do it, but you have to do it in steps.
There are different types of memory, like normal memory "RAM" or disk space in the persistent store. You have to configure them separately. Since when the "RAM" memory is out, the message is swapped out and fetched from store (depending a bit on the configuration).
However, you have hopefully a system wide limit, like this:
<systemUsage>
<systemUsage>
<memoryUsage>
<memoryUsage percentOfJvmHeap="70" />
</memoryUsage>
<storeUsage>
<storeUsage limit="100 gb"/>
</storeUsage>
<tempUsage>
<tempUsage limit="50 gb"/>
</tempUsage>
</systemUsage>
</systemUsage>
Given these entries as a starting point, you can apply per-destination-policies that limit certain queue(s). That is set as a percentage of the system memory, so you need to do some calculations.
Use storeUsageHighWaterMark and/or cursorMemoryHighWaterMark depending on the effect you want. Note that store is not used for non persistent messages.
For a basic memory limit, you can also use the memoryLimit setting on the destination policy. It's a child to the memoryUsage system property.
<policyEntry queue="ANOTHER.>" producerFlowControl="true" memoryLimit="12 kb">
not limiting the queue size to 12 kb
Related
Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.
my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="#db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="#finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
poller is triggered by cron. message is enriched with some information about this run.
message is sent to multiple data-source chains.
each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
resultset is split into smaller chunks, transformed and inserted into db2.
after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?
First of all since version 4.0.4 Spring Integration's <splitter> supports Iterator as payload to avoid memory overhead.
We have a test-case for the JDBC which shows that behaviour. But as you see it is based on the Spring Integration Java DSL and Java 8 Lamdas. (Yes, it can be done even for older Java versions without Lamdas). Even if this case is appropriate for you, your <aggregator> should not be in-memory, because it collects all messages to the MessageStore.
That's first case.
Another option is based on the paging algorithm, when your SELECT accepts a pair of WHERE params in the your DB dialect. For Oracle it can be like: Paging with Oracle.
Where the pageNumber is some message header - :headers[pageNumber]
After that you do some trick with <recipient-list-router> to send a SELECT result to the save channel and to some other channel wich increments pageNumber header value and sends a message to the data.source.1 channel and so on. When the pageNumber becomes out of data scope, the <int-jdbc:outbound-gateway> stops produces results.
Something like that.
I don't say that it so easy, but it should be a start point for you, at least.
I have a network of brokers on a Complete Graph topology with 3 nodes at different servers: A, B and C. Every broker has a producer attached and, for testing purposes, only one non-broker consumer on broker C. As I'm using the Complete Graph topology every broker also has a broker-consumer for each of the other nodes.
The problem is: A receives a few messages. I expect it to forward those messages to broker C, which has a "real" consumer attached. This is not happening, broker A stores those messages until a "real" consumer connects to it.
What's wrong with my configuration (or understanding)?
I'm using ActiveMQ 5.9.0.
Here's my activemq.xml for broker A. It's the same for B and C, only changing names:
<beans
xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://activemq.apache.org/schema/core http://activemq.apache.org/schema/core/activemq-core.xsd">
<broker xmlns="http://activemq.apache.org/schema/core" brokerName="broker-A" dataDirectory="${activemq.data}">
<destinationPolicy>
<policyMap>
<policyEntries>
<policyEntry topic="tokio.>">
<subscriptionRecoveryPolicy>
<noSubscriptionRecoveryPolicy/>
</subscriptionRecoveryPolicy>
<pendingMessageLimitStrategy>
<constantPendingMessageLimitStrategy limit="1000"/>
</pendingMessageLimitStrategy>
</policyEntry>
</policyEntries>
</policyMap>
</destinationPolicy>
<managementContext>
<managementContext createConnector="true"/>
</managementContext>
<persistenceAdapter>
<kahaDB directory="${activemq.data}/kahadb"/>
</persistenceAdapter>
<systemUsage>
<systemUsage>
<memoryUsage>
<memoryUsage percentOfJvmHeap="70" />
</memoryUsage>
<storeUsage>
<storeUsage limit="40 gb"/>
</storeUsage>
<tempUsage>
<tempUsage limit="10 gb"/>
</tempUsage>
</systemUsage>
</systemUsage>
<networkConnectors>
<networkConnector name="linkTo-broker-B"
uri="static:(tcp://SRVMSG01:61616)"
duplex="true"
/>
<networkConnector name="linkTo-broker-C"
uri="static:(tcp://SRVMSG03:61616)"
duplex="true"
/>
</networkConnectors>
<transportConnectors>
<transportConnector uri="tcp://localhost:0" discoveryUri="multicast://default"/>
<transportConnector name="nio" uri="nio://0.0.0.0:61616" />
</transportConnectors>
</broker>
</beans>
By default, networkTTL is 1 (see documentation), so when a producer on B publishes a message, if it takes the path to A (which it will do 50% of the time in your configuration because you've got the broker set up to round-robin between consumers, more on that in a second), it's not allowed to be forwarded to C. You could fix the problem by increasing the value of networkTTL, but the better solution is to set decreaseNetworkConsumerPriority=true (see documentation at same link as above) to ensure that messages always go as directly as possible to the consumer to which they're destined.
Note, however, that if your consumers move around the mesh, this can strand messages both because the networkTTL value won't allow additional forwards and because messages aren't allowed to be resent to a broker through which they've already passed. You can address those by setting networkTTL to a larger value (like 20, to be completely safe) and by applying the replayWhenNoConsumers=true policy setting described in the "Stuck Messages" section of that same documentation page. Neither of those settings is strictly necessary, as long as you're sure your consumers will never move to another broker or you're OK losing a few messages when it does happen.
I am using JBoss Cache (tried versions 3.2.7 and 3.1.0) to have a replicated Map for caching data between Application Servers. In the past I did some checks if it works out and it did. My testenvironment always were 2 nodes in the same network segment.
Since IT deparments sometimes have a problem with UDP, we use TCP (TCPPING for discovery).
Now a customer reported problems with our nodes losing their sync, not replicating data.
They have 4 nodes in 2 subnets (2 and 2). They say when they onbly use 2 nodes in any subnet it works. When they start the third, the problems begin.
The logfile state a lot of "merge" problems indicating a partitioning problem.
So I did my own tests in my company. My setup is my Laptop under Windows and two virtual machines with Ubuntu. The virtual machines use bridged networking interfaces. DHCP is used and our IT department provides my 3 nodes with different subnet IPs. My Host Laptop is in a different net than the virtual machines are. TCP Communication between the nodes works. There should be no firewall involved.
So much to my setup.
I wrote a little Program that just initializes JBoss Cache, gets the cache(MAP) and in an intervall changes values in the map, and afterwards showing the content of the whole map. Pretty simple, 2 Classes involved.
My JBoss Cache setup is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<jbosscache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="urn:jboss:jbosscache-core:config:3.1">
<clustering mode="replication" clusterName="${jgroups.clustername:DEFAULT}">
<stateRetrieval timeout="20000" fetchInMemoryState="true" />
<sync replTimeout="20000" />
<jgroupsConfig>
<TCP start_port="${jgroups.tcpping.start_port:7800}" loopback="true" recv_buf_size="20000000"
send_buf_size="640000" discard_incompatible_packets="true"
max_bundle_size="64000" max_bundle_timeout="30"
use_incoming_packet_handler="true" enable_bundling="false"
use_send_queues="false" sock_conn_timeout="3000"
skip_suspected_members="true" use_concurrent_stack="true"
thread_pool.enabled="true" thread_pool.min_threads="1"
thread_pool.max_threads="25" thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false" thread_pool.queue_max_size="100"
thread_pool.rejection_policy="run" oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run" />
<TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"
port_range="3" num_initial_members="3" />
<MERGE2 max_interval="100000" min_interval="20000" />
<MERGE3 max_interval="100000" min_interval="20000" />
<FD_SOCK />
<FD timeout="10000" max_tries="5" shun="true" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800" discard_delivered_msgs="true" />
<UNICAST timeout="300,600,1200" />
<pbcast.STABLE stability_delay="1000"
desired_avg_gossip="50000" max_bytes="400000" />
<VIEW_SYNC avg_send_interval="60000" />
<pbcast.GMS print_local_addr="true" join_timeout="6000"
shun="true" view_bundling="true" />
<FC max_credits="2000000" min_threshold="0.10" />
<FRAG2 frag_size="60000" />
<pbcast.STREAMING_STATE_TRANSFER />
</jgroupsConfig>
</clustering>
</jbosscache>
starting my test nodes I provide them with the following System properties.
-Djgroups.bind_addr=NODE1
-Djgroups.tcpping.initial_hosts=NODE1[7900],NODE2[7900],NODE3[7900]
-Djgroups.tcpping.start_port=7900
From the logging messages I can see the GMS message that the NODE address is indeed as specified NODE-X[7900] for all nodes.
NODE1-3 are given as IP numbers. Those IP numbers can be reached from other nodes.
NODE1,2 are in the same subnet
NODE3 is in a different subnet
I did an incredible amount of tests, changing the config, the JBossCache version, the combinations of nodes running etc.
Sometimes it works, sometimes it doesn't.
One cause that seems to influence if the members find each other is the initial hosts info. Depending on the order of the hosts, if there are additional hosts to the 3 given, leaving out the own ip out of the list the setup works or not. It also depends on the order the nodes are started.
I am sure it has to do with the JGroups Group membership. Maybe parameters need to be added to make it more robust.
I really would appreciate some hints on what to try in order to get the nodes talking to each other reliably.
I addition to trying to figure out the problems with JBoss Cache(JGroups) I did the same test using Hazelcast (TCP) instead. It PERFECTLY works without any problem, so the basic networking should work for the nodes.
I consider to switch to Hazelcast, but this requires redeployments in several of our customers IT departments and I would want to avoid that.
I am trying to use experiment with Spring Integration with a simple task. I have a folder where I get incoming files. The files are named after a group ID.
I want all the files in the same groupId to be processed in sequence but files with different groupIds can be processed in parallel.
I started putting together a configuration like this:
<int:service-activator input-channel="filesInChannel"
output-channel="outputChannelAdapter">
<bean class="com.ingestion.FileProcessor" />
</int:service-activator>
<int:channel id="filesInChannel" />
<int-file:inbound-channel-adapter id="inputChannelAdapter"
channel="filesInChannel" directory="${in.file.path}" prevent-duplicates="true"
filename-pattern="${file.pattern}">
<int:poller id="poller" fixed-rate="1" task-executor="executor"/>
</int-file:inbound-channel-adapter>
<int-file:outbound-channel-adapter id="outputChannelAdapter" directory="${ok.file.path}" delete-source-files="true"/>
<task:executor id="executor" pool-size="10"/>
This is processing all the incoming files with 10 threads. What are the steps I need to split the files by groupId and have them processed one thread per groupId?
Thanks.
Assuming a finite number of group ids, you could use a different adapter for each group (with a single thread; all feeding into the same channel); each with a different pattern.
Or you could create a custom FileListFilter and use some kind of thread affinity to assign files from each group to a specific thread, with the filter only returning this thread's file(s).