Kafka Streams memory allocation issues with RocksDB - apache-kafka-streams

I am trying to make a simple Kafka Stream application (v2.3.1 initially it was 2.3.0) which gathers statistics on specified time intervals (e.g. per minute namely tumbling windows).
Therefore I follow a textbook implementation like the one below
events
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofMinutes(1).grace(Duration.ZERO))
aggregate(...),Materialized.as("agg-metric")).withRetention(Duration.ofMinutes(5))
.suppress(Suppressed.untilWindowClose(BufferConfig.unbounded()))
.toStream((key, value) -> key.key())
Everything seems to works normally except that my memory footprint is constantly growing. I can see that my heap memory is stable so I assume that the issue is related with RocksDB instances created.
I have 10 partitions and since each window (with a default of 3 segments per partition), creates 3 segments, I expect in total to have 30 instances of RocksDB. Since the default configuration values for RocksDB are rather large for my app I choose to
change the default configuration for RocksDB, and implement RocksDBConfigSetter according to code below essentially
trying to impose a limit on off-heap memory consumption.
private static final long BLOCK_CACHE_SIZE = 8 * 1024 * 1024L;
private static final long BLOCK_SIZE = 4096L;
private static final long WRITE_BUFFER_SIZE = 2 * 1024 * 1024L;
private static final int MAX_WRITE_BUFFERS = 2;
private org.rocksdb.Cache cache = new org.rocksdb.LRUCache(BLOCK_CACHE_SIZE); // 8MB
private org.rocksdb.Filter filter = new org.rocksdb.BloomFilter();
#Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = new org.rocksdb.BlockBasedTableConfig();
tableConfig.setBlockCache(cache) // 8 MB
tableConfig.setBlockSize(BLOCK_SIZE); // 4 KB
tableConfig.setCacheIndexAndFilterBlocks(true);;
tableConfig.setPinTopLevelIndexAndFilter(true);
tableConfig.setFilter(new org.rocksdb.BloomFilter())
options.setMaxWriteBufferNumber(MAX_WRITE_BUFFERS); // 2 memtables
options.setWriteBufferSize(WRITE_BUFFER_SIZE); // 2 MB
options.setInfoLogLevel(InfoLogLevel.INFO_LEVEL);
options.setTableFormatConfig(tableConfig);
}
Based on the configuration values above and https://docs.confluent.io/current/streams/sizing.html I expect that the total memory allocated to RocksDB would be
(write_buffer_size_mb * write_buffer_count) + block_cache_size_mb => 2*2 + 8 => 12MB
Therefore for 30 instances the total allocated off-heap memory would account to 12 * 30 = 360 MB.
When I try to run this app on a VM with 2G memory I assign 512MB for the heap of the kafka streams app,
so based on my logic/understanding the total memory allocated should plateau at a value lower than 1 GB (512 + 360).
Unfortunately this does not seem to be the case since my memory does not stop to grow, albeit slowly after a point, but steadily almost ~2% per day and it is unavoidable to consume all VM memory at some point and finally kill the process.
The more concerning fact is that I never witness any release of the off-heap memory, even when my traffic is getting very low.
As a result, I am wondering what am I doing wrong in such a simple and common use-case.
Am I missing something while calculating memory consumption for my app ?
Is it possible to limit my memory consumption on my VM and what settings do I need to change on my configuration to limit memory allocation for Kafka stream app and RocksDB instances ?

Related

EhCache 3 : Too many items getting evicted from off heap store even though there is enough space off heap

I have the ehcache configuration that stores items offheap for 4G size. I do see a lot of items start to evict at around 100000 items in the cache even though cache occupied size is around 2.9G (out of 4G allocated). The constant stream of evictions at > 200 items per sec drops my hit rate to 65%.
Any idea what could be the cause of these evictions ? Is the off-heap store also limited by size or percentage of the allocated direct memory on the JVM. These are not expired items but evicted items for some reason.
CacheManager cacheManager = CacheManagerBuilder.newCacheManagerBuilder()
.using(statisticsService)
.withCache("cache", CacheConfigurationBuilder.newCacheConfigurationBuilder(String.class, String.class, ResourcePoolsBuilder.newResourcePoolsBuilder()
.offheap(4, MemoryUnit.GB).build())
.withExpiry(ExpiryPolicyBuilder.timeToIdleExpiration(Duration.ofMinutes(10)))
.withExpiry(ExpiryPolicyBuilder.timeToLiveExpiration(Duration.ofMinutes(10)))
.withService(WriteBehindConfigurationBuilder.newUnBatchedWriteBehindConfiguration().queueSize(10).concurrencyLevel(10).build())
)
.build()

what is the better way to iterate over chronicle map of size 3.5 million

I have implemented chronicle map as:
ChronicleMapBuilder
.of(LongValue.class, ExceptionQueueVOInterface.class)
.name(MAP_NAME)
.entries(2_000_000)
.maxBloatFactor(3)
.create();
Map has around 3.5 million records. I am using this map to server as cache for search operation. Many times to search for list of entries we have to iterate over whole map.
3.5 Million records taking 8.7 GB of RAM and my java heap allocated memory is 36 GB. Total RAM in server is 49 GB.
When i am iterating map using below code:
map.forEachEntry(entry -> {
if(exportSize.get() != MAX_EXPORT_SIZE && TeamQueueFilter.applyFilter(exceptionFilters, entry.value().get(), queType, uid)){
ExceptionQueue exceptionQueue = getExceptionQueue(entry.value().get());
if(exceptionFilters.isRts23Selected() || exceptionQueue.getMessageType().equals(MessageType.RTS23.toString())){
exceptionQueue.setForsId(Constant.BLANK_STRING);
rts23CaseIdMap.put(String.valueOf(entry.key().get().getValue()), exceptionQueue);
}else {
handleMessage(exceptionQueue, entry.key().get(), caseDetailsList);
}
exportSize.incrementAndGet();
}
});
it always gives me memory error. VM.array size is less than available memory.
Any tips here to iterate over this large map. Thanks for help.
I have to guess as you haven't provided any error here that the error message you are getting says something like java.lang.OutOfMemoryError: Direct buffer memory.
Chronicle Map will not allocate its contents on heap, instead it uses offheap memory-mapping, and likely you're running out of offheap space. You need to check what is the size of the Chronicle Map's file on the disk, the same amount of the offheap memory will be used for mmapping.
Offheap memory can be adjusted using the -XX:MaxDirectMemorySize flag.

How to interpret stats/show in Rebol 3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776
It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.
If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

Why do we need external sort?

The main reason for external sort is that the data may be larger than the main memory we have.However,we are using virtual memory now, and the virtual memory will take care of swapping between main memory and disk.Why do we need to have external sort then?
An external sort algorithm makes sorting large amounts of data efficient (even when the data does not fit into physical RAM).
While using an in-memory sorting algorithm and virtual memory satisfies the functional requirements for an external sort (that is, it will sort the data), it fails to achieve the non-functional requirement of being efficient. A good external sort minimises the amount of data read and written to external storage (and historically also seek times), and a general-purpose virtual memory implementation on top of a sort algorithm not designed for this will not be competitive with an algorithm designed to minimise IO.
In addition to #Anonymous's answer that external sort is better optimized for less disk IO, sometimes using in-memory sort and using the virtual memory is infeasible, since the virtual memory space is smaller than the file's size.
For example, if you have a 32 bits system (there are still a lot of these), and you want to sort a 20 GB file, 32bits system allow you to have 2^32 ~= 4GB virtual addresses, but the file you are trying to sort cannot fit in.
This used to be a real issue when 64 bits systems were still not very common, and is still an issue today for old 32 bits systems and some embadded devices.
However, even for 64 bits system, as expained in previous answers, the external sort algorithm is more optimized for the nature of sorting, and will require significantly less disk IO than letting the OS "take care of things".
I'm using Windows, in common line shell, you could run "systeminfo", it gives me my laptop's memory usage information.
Total Physical Memory: 8,082 MB
Available Physical Memory: 2,536 MB
Virtual Memory: Max Size: 11,410 MB
Virtual Memory: Available: 2,686 MB
Virtual Memory: In Use: 8,724 MB
I just write a app to test max size of array I could initialize from my laptop.
public static void BurnMemory()
{
for(var i = 1; i <= 1024; i++)
{
long size = 1 << i;
long t = 4 * size / (1 << 30);
try
{
// 1 int32 takes 32 bit(4 byte) memmory,
var arr = new int[size];
Console.WriteLine("Test pass initialize a array with size = 2^" + i.ToString());
}
catch(OutOfMemoryException err)
{
Console.WriteLine("Reach memory limitation when initialize a array with size = 2^{0} int32 = 4 x {1}B= {2}TB",i, size, t );
break;
}
}
}
It seems it terminate when it is trying to initialize array with size of 2^29.
Reach memory limitation when initialize a array with size = 2^29 int32 = 4 x 536870912B= 2TB
What I get from the test:
It is not hard to reach the memory limitation.
We need to understand our server's capability, then decide whether use in-memory sort or external sort.

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Resources