Redis: How can I check how much memory used in real time? - memory-management

I want check how much memory used in real time, for instance, each time I set or insert some data, I want to know how much memory increased and how much used totally.
I try to use INFO command, and check the used_memory or used_memory_* property would work or not, but sorry I found it only show the memory allotted by the system, cuz each time I check it after I insert new data, they still stay the same
Is there any way I can check the real time memory used in Redis?

The used_memory field is what you are looking for. It is not the memory allocated by the system as you said, this is the memory given by the process memory allocator to Redis.
Example:
> info memory
...
used_memory:541368
...
> set y "titi"
OK
> info memory
...
used_memory:541448 # i.e. +80 bytes
...
> del y
(integer) 1
> info memory
...
used_memory:541368
...
Please note that Redis does a number of memory related optimizations. For instance, it is able to factorize values containing small integers. Or, if you append data to an existing string, the corresponding buffer will not grow at each append operation. So depending on these optimizations, the memory usage increase/decrease for a given set of operation is not always consistent.

Related

What's the maximum number of elements a scheme list can have?

I'm working with Chicken Scheme, I wonder how many elements a list can have.
There is no hard limit – it can have as many as there's room for in memory.
The documentation, under the option -:hmNUMBER, it mentions that there is a default max limit of 2GB heap size, which gives you about 45 million pairs. You can increase these with several options, but the simplest for a set default memory limit is -heap-size. Here is how to double the default:
csc -heap-size 4000M <file>
It says under the documentation for -heap-size that it only uses half of the allocated memory at every given time. It might be using the lonely hearts garbage collection algorithm where when memory is full it moves used memory to the unused segment making the old segment the unused one.

TensorFlow object detection limiting memory and cpu usage

I manage to run tensorflow pet example from the tutorial. I decided to use the slowest model (because I want to use for my own data). However, when I start the training it gets killed after running a bit. It used all my cpus (4) and all my memory 8GB. Do you know anyway I can limit the number of CPUs to 2 and limit the amount of memory used ? If I reduce the batch size ? My batch size is already 1.
I managed to run by reducing the resize:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 300
max_dimension: 612
}
Thanks in advance.
Another idea to reduce memory usage is to reduce the queue sizes for input data. Specifically, in the object_detection/protos/train.proto file, you will see entries for batch_queue_capacity and prefetch_queue_capacity --- consider setting these fields explicitly in your config file to smaller numbers.

Is it okay to use dictionary memory without 'allot'?

I am doing a programming exercise where I'm trying to do the same thing in different ways. (I happen to be adding two 3 element vectors together in Forth). In one of my revisions I used the return stack to store temporary values (so I am using that feature), but in addition to that I am considering using un-allocated memory as temporary storage.
I created two words to access this memory:
: front! here + ! ;
: front# here + # ;
I tried it in my experiment, and it seemed to work for what I was doing. I don't have any intention to use this memory after my routines are done. And I am living in dictionary, of which memory has already been given to the program.
But, my gut still tells me that this is a bad thing to do. Is this such a bad thing?
If it matters, I'm using Gforth.
Language-lawyer strictly speaking, no. ANS Forth 3.3.3.2 states:
A program may perform address arithmetic within contiguously allocated regions.
You are performing address arithmetic outside any allocated region.
However, it might be perfectly fine in some particular implementation. Such as gforth.
Note that there is a word called PAD, which returns an address to a temporary memory region.
It's okay if you know what you are doing, bud PAD is a better place than HERE to do it. There is also the alternative ALLOCATE and FREE:
ALLOCATE ( u -- a-addr ior )
Allocate u address units of contiguous data space. The data-space
pointer is unaffected by this operation. The initial content of the
allocated space is undefined.
If the allocation succeeds, a-addr is the aligned starting address of
the allocated space and ior is zero.
If the operation fails, a-addr does not represent a valid address and
ior is the implementation-defined I/O result code.
FREE ( a-addr -- ior )
Return the contiguous region of data space indicated by a-addr to the
system for later allocation. a-addr shall indicate a region of data
space that was previously obtained by ALLOCATE or RESIZE. The
data-space pointer is unaffected by this operation.
If the operation succeeds, ior is zero. If the operation fails, ior is
the implementation-defined I/O result code. American National Standard for Information Systems
: front! here + ! ;
What's the stack diagram? I guess ( n offset_in_cells -- )?

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Redis 10x more memory usage than data

I am trying to store a wordlist in redis. The performance is great.
My approach is of making a set called "words" and adding each new word via 'sadd'.
When adding a file thats 15.9 MB and contains about a million words, the redis-server process consumes 160 MB of ram. How come I am using 10x the memory, is there any better way of approaching this problem?
Well this is expected of any efficient data storage: the words have to be indexed in memory in a dynamic data structure of cells linked by pointers. Size of the structure metadata, pointers and memory allocator internal fragmentation is the reason why the data take much more memory than a corresponding flat file.
A Redis set is implemented as a hash table. This includes:
an array of pointers growing geometrically (powers of two)
a second array may be required when incremental rehashing is active
single-linked list cells representing the entries in the hash table (3 pointers, 24 bytes per entry)
Redis object wrappers (one per value) (16 bytes per entry)
actual data themselves (each of them prefixed by 8 bytes for size and capacity)
All the above sizes are given for the 64 bits implementation. Accounting for the memory allocator overhead, it results in Redis taking at least 64 bytes per set item (on top of the data) for a recent version of Redis using the jemalloc allocator (>= 2.4)
Redis provides memory optimizations for some data types, but they do not cover sets of strings. If you really need to optimize memory consumption of sets, there are tricks you can use though. I would not do this for just 160 MB of RAM, but should you have larger data, here is what you can do.
If you do not need the union, intersection, difference capabilities of sets, then you may store your words in hash objects. The benefit is hash objects can be optimized automatically by Redis using zipmap if they are small enough. The zipmap mechanism has been replaced by ziplist in Redis >= 2.6, but the idea is the same: using a serialized data structure which can fit in the CPU caches to get both performance and a compact memory footprint.
To guarantee the hash objects are small enough, the data could be distributed according to some hashing mechanism. Assuming you need to store 1M items, adding a word could be implemented in the following way:
hash it modulo 10000 (done on client side)
HMSET words:[hashnum] [word] 1
Instead of storing:
words => set{ hi, hello, greetings, howdy, bonjour, salut, ... }
you can store:
words:H1 => map{ hi:1, greetings:1, bonjour:1, ... }
words:H2 => map{ hello:1, howdy:1, salut:1, ... }
...
To retrieve or check the existence of a word, it is the same (hash it and use HGET or HEXISTS).
With this strategy, significant memory saving can be done provided the modulo of the hash is
chosen according to the zipmap configuration (or ziplist for Redis >= 2.6):
# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given number of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
Beware: the name of these parameters have changed with Redis >= 2.6.
Here, modulo 10000 for 1M items means 100 items per hash objects, which will guarantee that all of them are stored as zipmaps/ziplists.
As for my experiments, It is better to store your data inside a hash table/dictionary . the best ever case I reached after a lot of benchmarking is to store inside your hashtable data entries that are not exceeding 500 keys.
I tried standard string set/get, for 1 million keys/values, the size was 79 MB. It is very huge in case if you have big numbers like 100 millions which will use around 8 GB.
I tried hashes to store the same data, for the same million keys/values, the size was increasingly small 16 MB.
Have a try in case if anybody needs the benchmarking code, drop me a mail
Did you try persisting the database (BGSAVE for example), shutting the server down and getting it back up? Due to fragmentation behavior, when it comes back up and populates its data from the saved RDB file, it might take less memory.
Also: What version of Redis to you work with? Have a look at this blog post - it says that fragmentation has partially solved as of version 2.4.

Resources