assuming all keys in a redis instance have an expire set, volatile-lru and allkeys-lru are similar. But is there a significative performance difference between the 2 when a key is removed?
Bonus question:
between 2 distinct instances configured with the allkeys-lru policy, having the same content and same configuration, except:
Instance A has all its keys with an expire set (different values of expire)
Instance B has none key with an expire set
Aside the overhead of memory in instance A due to the expires bits, is there a performance difference between the 2 when a key is removed by the allkeys-lru algorithm?
In both cases, I'm talking about instances of redis 2.4.x on linux 64 bits with maxmemory = 3Gb with 4-5000 keys when the maxmemory is reached (most of the keys are hashes).
Thanks
redis.c, line 2311, unstable branch:
/* volatile-lru and allkeys-lru policy */
else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
{
for (k = 0; k < server.maxmemory_samples; k++) {
sds thiskey;
long thisval;
robj *o;
de = dictGetRandomKey(dict);
thiskey = dictGetKey(de);
/* When policy is volatile-lru we need an additonal lookup
* to locate the real key, as dict is set to db->expires. */
if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
de = dictFind(db->dict, thiskey);
o = dictGetVal(de);
thisval = estimateObjectIdleTime(o);
/* Higher idle time is better candidate for deletion */
if (bestkey == NULL || thisval > bestval) {
bestkey = thiskey;
bestval = thisval;
}
}
}
It seems like all things being equal allkeys-lru would be strictly speaking faster, but not by a significant magnitude. Chances are we are talking about not much more than a fraction of a microsecond faster.
The second question got pretty much already answered, but just in case: it looks like it makes no difference to allkeys-lru how many keys are set to expire, or if any are. Both Instance A and B in your example would see the same performance when a key is purged by the lru algorithm.
Related
The below code "works" but I am confused on the meaning on the values passed in Stores.persistentWindowStore(). I found the documentation (https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/state/Stores.html#persistentWindowStore-java.lang.String-long-int-long-boolean-) but the definition of the args is not clear to me.
Should the windowBy() value always match windowSize in persistentWindowStore()?
What should the retention period be set to? The retention policy of the source topic?
What do the number of segments do?
What are retaining duplicates for? Document seems to indicate to set to true for joins?
long windowSize = TimeUnit.MINUTES.toMillis(15);
long retentionPeriod = windowSize*4*6 //6 hours
int numSegments = 2;
boolean retainDuplicates = false;
bdrStream.groupByKey().windowedBy(TimeWindows.of(windowSize))
.aggregate(() -> Lists.newArrayList(),
(aggKey, newValue, aggValue) -> {
BdrData d = new BdrData();
d.setCharge(newValue.getBdr().getCost());
aggValue.add(d);
return aggValue;
},
Materialized.<String, ArrayList<BdrData>>as(
Stores.persistentWindowStore("store5",
retentionPeriod,
numSegments,
windowSize,
retainDuplicates))
.withKeySerde(Serdes.String())
.withValueSerde(listBdrDataSerde))
.toStream()
.process(() -> new WindowAggregatorProcessor());
Should the windowBy() value always match windowSize in persistentWindowStore()?
Yes.
What should the retention period be set to? The retention policy of the source topic?
It should match the retention period of the windows, that you can specify via Windows#until() (default is 1 day)
What do the number of segments do?
The number of segments determine how coarse/fine grained data (ie, old windows) are expired. The segment size will be "retention-period / (#segments + 1)". Note, that more segments give you more fine grained data expiration but increases overhead (each segment uses it's own RocksDB instance)
What are retaining duplicates for? Document seems to indicate to set to true for joins?
By default, keys must be unique. If you enable retain duplicates, you can store the same key multiple times. Enabling duplicates comes with a performance hit.
Note:
This part of the API was reworked and simplified in upcoming 2.1 release. Compare KIP-319 and KIP-328 for details.
A couple months ago, we were using UUIDs to generate random string IDs that needed to be unique across the board. I then changed the algorithm in order to save some data and index space in our database. I tested a few ways to generate unique string IDs, and I decided to use this function:
function generateToken($length) {
$characters = '0123456789abcdefghijklmnopqrstuvwxyz';
$max = strlen($characters) - 1;
$token = '';
for ($i = 0; $i < $length; $i++) {
$token .= $characters[mt_rand(0, $max)];
}
return $token;
}
I'm using this function to generate IDs that are 20 characters long using digits and letters, or you could say these IDs are numbers in base 36. The probability of any 2 IDs colliding should be 1/36^20, but due to the birthday paradox, it can be expected for a collision to occur after about 36^10 records - that's 3.6 quadrillion records. Yet, just a few hours ago a collision occurred, when there were only 5.3 million existing records in the database. Am I extremely unlucky, or my ID-generating function is flawed with respect to randomness? I know mt_rand() isn't truly random, but it is random enough, isn't it?
I would've written a loop that checks if the generated ID is unique and generates a new one if it isn't, but I thought that the chance of getting a collision was so small that the performance cost of such a loop wasn't worth it. I will now include such a loop in the code, but I'm still interested in perfecting the ID generation function if it is indeed flawed.
The implementation of mt_rand() in PHP is rather fluid, so it may differ from one version to the next. However, here are some excerpts from the code used in PHP version 5:
php_rand.h:
/* MT Rand */
#define PHP_MT_RAND_MAX ((long) (0x7FFFFFFF)) /* (1<<31) - 1 */
#ifdef PHP_WIN32
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * GetCurrentProcessId())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#else
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * getpid())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#endif
PHPAPI void php_srand(long seed TSRMLS_DC);
PHPAPI long php_rand(TSRMLS_D);
PHPAPI void php_mt_srand(php_uint32 seed TSRMLS_DC);
PHPAPI php_uint32 php_mt_rand(TSRMLS_D);
rand.c:
PHP_FUNCTION(mt_rand)
{
long min;
long max;
long number;
int argc = ZEND_NUM_ARGS();
if (argc != 0) {
if (zend_parse_parameters(argc TSRMLS_CC, "ll", &min, &max) == FAILURE) {
return;
} else if (max < min) {
php_error_docref(NULL TSRMLS_CC, E_WARNING, "max(%ld) is smaller than min(%ld)", max, min);
RETURN_FALSE;
}
}
if (!BG(mt_rand_is_seeded)) {
php_mt_srand(GENERATE_SEED() TSRMLS_CC);
}
From the last three lines above, you can see that mt_rand() is automatically seeded the first time it is called. However, the php_mt_srand() function takes an argument of type php_uint32. This means there are only 232 possible seeded states for mt_rand(). So if your script runs roughly 216 times, it is quite likely that mt_rand() will produce the exact same sequence of random numbers.
As suggested by rossum, it would be a much better idea to apply AES encryption to an incrementing 128-bit value. If you base64-encode the encrypted results and discard the trailing ==, then the resulting strings will only be 22 characters long.
Addendum
I left the following script running while I was out this afternoon:
for i in $(seq 1 100000) ; do
php -r 'for ($n=0; $n<32; $n++) echo chr(mt_rand(97,122)); echo chr(10);' >>out
done &
As expected, the first collision occurred after about 216 iterations (which is nowhere near 2616):
$ sort <out | uniq -d
vnexqclzkaluntglgadgwzjnjfsvqfhp
$ grep -n vnexqclzkaluntglgadgwzjnjfsvqfhp out
34417:vnexqclzkaluntglgadgwzjnjfsvqfhp
52159:vnexqclzkaluntglgadgwzjnjfsvqfhp
If you want guaranteed unique 16 byte IDs then I would use encryption. AES uses 16 byte (128 bit) blocks and as long as the inputs are unique the outputs are also guaranteed unique.
Set up AES in ECB mode (which is simpler and faster) and encrypt the numbers 0, 1, 2, 3, 4, ... Your inputs are unique so the outputs will be unique as well.
Crypto sites will tell you that ECB mode has security problems, but those problems only apply if the inputs are not unique. For unique 'random' number generation, as you require, those problems do not apply as your inputs are all unique.
I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.
I am trying to find a good solution for this question -
Implement two functions that assign/release unique id's from a pool. Memory usage should be minimized and the assign/release should be fast, even under high contention.
alloc() returns available ID
release(id) releases previously assigned ID
The first thought was to maintain a map of IDs and availability(in boolean). Something like this
Map<Integer, Boolean> availabilityMap = new HashMap();
public Integer alloc() {
for (EntrySet es : availabilityMap.entrySet()) {
if (es.value() == false) {
Integer key = es.key();
availabilityMap.put(key, true);
return key;
}
}
}
public void release(Integer id) {
availabilityMap.put(id, false);
}
However this is not ideal for multiple threads and "Memory usage should be minimized and the assign/release should be fast, even under high contention."
What would be a good way to optimize both memory usage and speed?
For memory usage, I think map should be replaced with some other data structure but I am not sure what it is. Something like bit map or bit set? How can I maintain id and availability in this case?
For concurrency I will have to use locks but I am not sure how I can effectively handle contention. Maybe put availabile ids in separate chunks so that each of them can be accessed independently? Any good suggestions?
First of all, you do not want to run over entire map in order to find available ID.
So you can maintain two sets of IDs, the first one for available IDs, and the second one is for allocated IDs.
Then it will make allocation/release pretty easy and fast.
Also you can use ConcurrentMap for both containers (sets), it will reduce the contention.
Edit: Changed bottom sentinel, fixed a bug
First, don't iterate the entire map to find an available ID. You should only need constant time to do it.
What you could do to make it fast is to do this:
Create an int index = 1; for your counter. This is technically the number of IDs generated + 1, and is always > 0.
Create a ArrayDeque<Integer> free = new ArrayDeque<>(); to house the free IDs. Guaranteed constant access.
When you allocate an ID, if the free ID queue is empty, you can just return the counter and increment it (i.e. return index++;). Otherwise, grab its head and return that.
When you release an ID, push the previously used ID to the free deque.
Remember to synchronize your methods.
This guarantees O(1) allocation and release, and it also keeps allocation quite low (literally once per free). Although it's synchronized, it's fast enough that it shouldn't be a problem.
An implementation might look like this:
import java.util.ArrayDeque;
public class IDPool {
int index = 1;
ArrayDeque<Integer> free = new ArrayDeque<>();
public synchronized int acquire() {
if (free.isEmpty()) return index++;
else return free.pop();
}
public synchronized void release(id) {
free.push(id);
}
}
Additionally, if you want to ensure the free ID list is unique (as you should for anything important) as well as persistent, you can do the following:
Use an HashMap<Integer id, Integer prev> to hold all generated IDs. Remember it doesn't need to be ordered or even iterated.
This is technically going to be a stack encoded inside a hash map.
Highly efficient implementations of this exist.
In reality, any unordered int -> int map will do here.
Track the top ID for the free ID set. Remember that 1 can represent nothing and zero used, so you don't have to box it. (IDs are always positive.) Initially, this would just be int top = 1;
When allocating an ID, if there are free IDs (i.e. top >= 2), do the following:
Set the new top to the old head's value in the free map.
Set the old top's value in the map to 0, marking it used.
Return the old top.
When releasing an old ID, do this instead:
If the old ID is already in the pool, return early, so we don't corrupt it.
Set the ID's value in the map to the old top.
Set the new top to the ID, since it's always the last one to use.
The optimized implementation would end up looking like this:
import java.util.HashMap;
public class IDPool {
int index = 2;
int top = 1;
HashMap<Integer, Integer> pool = new HashMap<>();
public synchronized int acquire() {
int id = top;
if (id == 1) return index++;
top = pool.replace(id, 0);
return id;
}
public synchronized void release(id) {
if (pool.getOrDefault(id, 1) == 0) return;
pool.put(id, top);
top = id;
}
}
If need be, you could use a growable integer array instead of the hash map (it's always contiguous), and realize significant performance gains. Matter of fact, that is how I'd likely implement it. It'd just require a minor amount of bit twiddling to do so, because I'd maintain the array's size to be rounded up to the next power of 2.
Yeah...I had to actually write a similar pool in JavaScript because I actually needed moderately fast IDs in Node.js for potentially high-frequency, long-lived IPC communication.
The good thing about this is that it generally avoids allocations (worst case being once per acquired ID when none are released), and it's very amenable to later optimization where necessary.
I would like to get the running maximum by writing Stata code.
I think I am quite close:
gen ctrhigh`iv' = max(ctr, L1.ctr, L2.ctr, L3.ctr, ..., L`iv'.ctr)
As you can see, my data are time series and `iv' represents the window (e.g. 5, 10 or 200 days)
The only problem is that you cannot pass a varlist or string containing numbers to max. E.g. the following is not possible:
local ivs 5 10 50 100 200
foreach iv in `ivs' {
local vals
local i = 1
while (`i' <= `iv') {
vals "`vals' `i'"
local ++i
}
gen ctrhigh`iv' = max(varlist vals) //not possible
}
How would I achieve this instead?
Example of quickly computing a running standard deviation
* standard deviation of ctr, see http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods *
gen ctr_sq = ctr^2
by tid: gen ctr_cum = sum(ctr) if !missing(ctr)
by tid: gen ctr_sq_cum = sum(ctr_sq) if !missing(ctr_sq)
foreach iv in $ivs {
if `iv' == 1 continue
by tid: gen ctr_sum = ctr_cum - L`iv'.ctr_cum if !missing(ctr_cum) & !missing(L`iv'.ctr_cum)
by tid: gen ctr_sq_sum = ctr_sq_cum - L`iv'.ctr_sq_cum if !missing(ctr_sq_cum) & !missing(L`iv'.ctr_sq_cum)
by tid: gen ctrsd`iv' = sqrt((`iv' * ctr_sq_sum - ctr_sum^2) / (`iv'*(`iv'-1))) if !missing(ctr_sq_sum) & !missing(ctr_sum)
label variable ctrsd`iv' "Rolling std dev of close ticker rank by `iv' days."
drop ctr_sum ctr_sq_sum
}
drop ctr_sq ctr_cum ctr_sq_cum
Note: this is not an exact sd, it's an approximation. I realize that this is very different from a maximum, but this may serve as an illustration on how to deal with large data computations.
Your example is time series data and implies that you have tsset the data. You don't say whether you also have panel or longitudinal structure. I will assume the worst and assume the latter as it doesn't make the code much worse. So, suppose tsset id date. In fact, that's irrelevant to the code here except to make explicit my assumption that id is an identifier and date a time variable.
An unattractive way to do this is to loop over observations. Suppose window is set to 42.
local window = 42
gen max = .
tsset id date
quietly forval i = 1/`=_N' {
su ctr if inrange(date, date[`i'] - `window', date[`i']) & id == id[`i'], meanonly
replace max = r(max) in `i'
}
So, in words as well: summarize values of ctr if date within window and it's in the same panel (same id), and put the maximum in the current observation.
The meanonly option is not well named. It calculates some other quantities besides the mean, and the maximum is one. But you do want the meanonly option to make summarize go as fast as possible.
See my 2007 paper on events in intervals, freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0033
I say unattractive, but this approach does have the advantage that it is easy to work with once you understand it.
I am not setting up an expression with lots of arguments to max(). You said 200 as an example and nothing stated that you might not ask for more, so far as I can see there may be no upper limit on window length, but there will be a limit on how complicated that expression can be.
If I think of a better way to do it, I'll post it. Or someone else will....
It seems like I can pass a string of arguments to max, like so:
* OPTION 1: compute running max by days *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
}
by tid: gen ctrh`iv' = max(`vars')
}
* OPTION 2: compute running max by days, ensuring that entire range is nonmissing *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
local condition "!missing(ctr)"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
local condition "`condition' & !missing(L`i'.ctr)"
}
by tid: gen ctrh`iv' = max(`vars') if `condition'
}
This computes very quickly and does exactly what I need.
However, if you need an arbitrarily large window I think you should resort to Nick's answer.