Opposite of Bloom filter? - data-structures

I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently.
So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an equivalent to a Bloom filter, but erring on the opposite side, that is, only giving false negatives.
I've skimmed through the literature without any luck.

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".
Forgive the extremely crude pseudocode:
setup_test_table():
create test_table( some large number of entries )
clear each entry( test_table, NEVER )
return test_table
has_test_been_run_before( new_test_details, test_table ):
index = hash( test_details , test_table.length )
old_details = test_table[index].detail
// unconditionally overwrite old details with new details, LRU fashion.
// perhaps some other collision resolution technique might be better.
test_table[index].details = new_test_details
if ( old_details === test_details ) return YES
else if ( old_details === NEVER ) return NEVER
else return PERHAPS
main()
test_table = setup_test_table();
loop
test_details = generate_random_test()
status = has_test_been_run_before( test_details, test_table )
case status of
YES: do nothing;
NEVER: run test (test_details);
PERHAPS: if( rand()&1 ) run test (test_details);
next loop
end.

The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.
function set_member(set, item)
set[hash(item) % set.length] = item
function is_member(set, item)
return set[hash(item) % set.length] == item

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

How about an LRUCache?

I think you're leaving out part of the solution; to avoid false positives entirely you will still have to track which have run, and essentially use the bloom filter as a shortcut to determine the a test definitely has not been run.
That said, since you know the number of tests in advance, you can size the filter in such a way as to provide an acceptable error rate using some well-known formulae; for a 1% probability of returning a false positive you need ~9.5 bits/entry, so for one million entries 1.2 megabytes is sufficient. If you reduce the acceptable error rate to 0.1%, this only increases to 1.8 MB.
The Wikipedia article Bloom Filters gives a great analysis, and an interesting overview of the maths involved.

Use a bit set, as mentioned above. If you know the no. of tests you are going to run beforehand, you will always get correct results (present, not-present) from the data structure.
Do you know what keys you will be hashing? If so, you should run an experiment to see the distribution of the keys in the BloomFilter so you can fine tune it to reproduce false positives, or what have you.
You might want to checkout HyperLogLog as well.

I'm sorry I'm not much help - I don't think its possible. If test execution can't be ordered maybe use a packed format (8 tests per byte!) or a good sparse array library for storing the outcomes in memory.

The data structure you expect does not exist. Because such data structure must be a many-to-one mapping, or say, a limited state set. There must be at least two different inputs mapping to the same internal state. So you can't tell whether both (or more) of them are in the set, only knowing at least one of such input exists.
EDIT This statement is true only when you are looking for a memory efficient data structure. If memory is unlimited, you can always get a data structure to give 100% accurate results, by storing every member item.

No and if you think about it, it wouldn't be very useful. In your case you couldn't be sure that your test run would ever stop, because if there are always 'false negatives' there will always be tests that need to be run...
I would say you just have to use a hash.

Related

Is there a big performance difference with Cassandra LWTs?

I have a similar question, but it's not clear, so I'm asking myself. Simple. I want to update only when there is a target row, and if it doesn't, it shouldn't be done. (No new rows should be added.)
After Select, check the value and Update
Direct processing via IF EXISTS
In case 2, the official documentation warns of a clear performance issue. Considering the various circumstances, is the 2nd performance issue seriously serious? Or 1 and 2 are similar, but if there is a lot of data, will 2 be at a disadvantage from then on? Wouldn't it be a big problem to use IF EXISTS?
The first option where you select-then-update is invalid because there is no guarantee that the data wouldn't change between the time that you've read it until the time that you update it -- the data isn't locked.
Lightweight transactions (LWTs), also known as compare-and-set (CAS) statements, are expensive because they require four round trips to work. The main difference between LWTs and your select-then-update method is that the row is locked for the duration of the "transaction" so no other threads can update its value.
Since you only want to perform an update if the row exists, then you must specify the IF EXISTS condition. Yes, there is a performance penalty in using lightweight transactions but it is unavoidable in your case where you have a conditional update. It isn't negotiable in your scenario. Cheers!

What is the most efficient Ruby data structure to track progress?

I am working on a small project which progressively grows a list of links and then processes them through a queue. There exists the likelihood that a link may be entered into the queue twice and I would like to track my progress so I can skip anything that has already been processed. I'm estimating around 10k unique links at most.
For larger projects I would use a database but that seems overkill for the amount of data I am working with and would prefer some form of in-memory solution that can potentially be serialized if I want to save progress across runs.
What data structure would best fit this need?
Update: I am already using a hash to track which links I have completed processing. Is this the most efficient way of doing it?
def process_link(link)
return if #processed_links[link]
# ... processing logic
#processed_links[link] = Time.now # or other state
end
If you aren't concerned about memory, then just use a Hash to check inclusion; insert and lookup times are O(1) average case. Serialization is straightforward (Ruby's Marshal class should take care of that for you, or you could use a format like JSON). Ruby's Set is an array-like object that is backed with a Hash, so you could just use that if you're so inclined.
However, if memory is a concern, then this is a great problem for a Bloom filter! You can achieve set inclusion testing in constant time and the filter uses substantially less memory than a hash would. The tradeoff is the Bloom filters are probabilistic - you can get false inclusion positives. You can eliminate the probability of most false positives with the right bloom filter parameters, but if duplicates are the exception rather than the rule, you could implement something like:
Check for set inclusion in the Bloom filter [O(1)]
If the bloom filter reports that the entry is found, perform an O(n) check of the input data, to see if this item has been found in the array of input data prior to now.
That would get you very fast and memory-efficient lookups for the common case, and you could make the choice to accept the possibility of false negatives (to keep the whole thing small and fast), or you could perform verification of set inclusion when a duplicate is reported (to only do expensive work when you absolutely have to).
https://github.com/igrigorik/bloomfilter-rb is a Bloom filter implementation I've used in the past; it works nicely. There are also redis-backed Bloom filters, if you need something that can perform set membership tracking and testing across multiple app instances.
How about a Set and convert your links to value object (rather than reference object) like Structs. By creating a value object the Set will be able to detect its uniqueness. Alternately, you could use a hash and store links by their PK.
The data structure could be a hash:
current_status = { links: [link3, link4, link5], processed: [link1, link2, link3] }
To track your progress (in percent):
links_count = current_status[:links].length + current_status[:processed].length
progress = (current_status[:processed].length * 100) / links_count # Will give you percent of progress
To process your links:
push any new link you need to process to current_status[:links].
Use shift to take from current_status[:links] the next link to be processed.
After processing a link, push it to current_status[:processed]
EDIT
As I see it (and understand your question), the logic to process your links would be:
# Add any new link that needs to be processed to the queue unless it have been processed
def add_link_to_queue(link)
current_status[:to_process].push(link) unless current_status[:processed].include?(link)
end
# Process next link on the queue
def process_next_link
link = current_status[:to_process].shift # return first link on the queue
# ... login process the link
current_status[:processed].push(link)
end
# shift method will not only return but also remove the link from the original array to avoid duplications

Google Go Lang Assignment Order

Let's look at the following Go code:
package main
import "fmt"
type Vertex struct {
Lat, Long float64
}
var m map[string]Vertex
func main() {
m = make(map[string]Vertex)
m["Bell Labs"] = Vertex{
40.68433, 74.39967,
}
m["test"] = Vertex{
12.0, 100,
}
fmt.Println(m["Bell Labs"])
fmt.Println(m)
}
It outputs this:
{40.68433 74.39967}
map[Bell Labs:{40.68433 74.39967} test:{12 100}]
However, if I Change one minor part of the test vertex declaration, by moving the right "}" 4 spaces, like so:
m["test"] = Vertex{
12.0, 100,
}
.. then the output changes to this:
{40.68433 74.39967}
map[test:{12 100} Bell Labs:{40.68433 74.39967}]
Why the heck does that little modification affect the order of my map?
Map "order" depends on the hash function used. The hash function is randomized to prevent denial of service attacks that use hash collisions. See the issue tracker for details:
http://code.google.com/p/go/issues/detail?id=2630
Map order is not guaranteed according to the specification. Although not done in current go implementations, a future implementation could do some compacting during GC or other operation that changes the order of a map without the map being modified by your code. It is unwise to assume a property not defined in the specification.
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type.
A map shouldn't always print its key-element in any fixed order:
See "Go: what determines the iteration order for map keys?"
However, in the newest Go weekly release (and in Go1 which may be expected to be released this month), the iteration order is randomized (it starts at a pseudo-randomly chosen key, and the hashcode computation is seeded with a pseudo-random number).
If you compile your program with the weekly release (and with Go1), the iteration order will be different each time you run your program.
It isn't exactly spelled out like that in the spec though (Ref Map Type):
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type.
Actually, the specs do spell it out, but in the For statement section:
The iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next.
If map entries that have not yet been reached are deleted during iteration, the corresponding iteration values will not be produced.
If map entries are inserted during iteration, the behavior is implementation-dependent, but the iteration values for each entry will be produced at most once.
If the map is nil, the number of iterations is 0.
This has been introduced by code review 5285042 in October 2011:
runtime: random offset for map iteration
The go-nuts thread points out:
The reason that "it's there to stop people doing bad stuff" seemed particularly weak.
Avoiding malicious hash collisions makes a lot more sense.
Further the pointer to the code makes it possible in development to revert that behaviour in the case that there is an intermittent bug that's difficult to work through.
To which Patrick Mylund Nielsen replies:
Dan's note was actually the main argument why Python devs were reluctant to adopt hash IV randomization--it broke their unit tests! PHP ultimately chose not to do it at all, and instead limited the size of the http.Request header, and Oracle and others didn't think it was a language problem at all.
Perl saw the problem and applied a fix similar to Go's which was included in Perl 5.8.1 in 2003.
I might be wrong, but I think they were the only ones to actually care then when this paper was presented: "Denial of Service via Algorithmic Complexity Attacks", attacking hash tables.
(Worst-case hash table collisions)
For others, this, which became very popular about a year ago, was a good motivator:
"28c3: Effective Denial of Service attacks against web application platforms (YouTube video, December 2011)", which shows how a common flaw in the implementation of most of the popular web
programming languages and platforms (including PHP, ASP.NET, Java, etc.) can be (ab)used to force web application servers to use 99% of CPU for several minutes to hours for a single HTTP request.
This attack is mostly independent of the underlying web application and just
relies on a common fact of how web application servers typically work..

What's the most efficient way to ignore code in lua?

I have a chunk of lua code that I'd like to be able to (selectively) ignore. I don't have the option of not reading it in and sometimes I'd like it to be processed, sometimes not, so I can't just comment it out (that is, there's a whole bunch of blocks of code and I either have the option of reading none of them or reading all of them). I came up with two ways to implement this (there may well be more - I'm very much a beginner): either enclose the code in a function and then call or not call the function (and once I'm sure I'm passed the point where I would call the function, I can set it to nil to free up the memory) or enclose the code in an if ... end block. The former has slight advantages in that there are several of these blocks and using the former method makes it easier for one block to load another even if the main program didn't request it, but the latter seems the more efficient. However, not knowing much, I don't know if the efficiency saving is worth it.
So how much more efficient is:
if false then
-- a few hundred lines
end
than
throwaway = function ()
-- a few hundred lines
end
throwaway = nil -- to ensure that both methods leave me in the same state after garbage collection
?
If it depends a lot on the lua implementation, how big would the "few hundred lines" need to be to reliably spot the difference, and what sort of stuff should it include to best test (the main use of the blocks is to define a load of possibly useful functions)?
Lua's not smart enough to dump the code for the function, so you're not going to save any memory.
In terms of speed, you're talking about a different of nanoseconds which happens once per program execution. It's harming your efficiency to worry about this, which has virtually no relevance to actual performance. Write the code that you feel expresses your intent most clearly, without trying to be clever. If you run into performance issues, it's going to be a million miles away from this decision.
If you want to save memory, which is understandable on a mobile platform, you could put your conditional code in it's own module and never load it at all of not needed (if your framework supports it; e.g. MOAI does, Corona doesn't).
If there is really a lot of unused code, you can define it as a collection of Strings and loadstring() it when needed. Storing functions as strings will reduce the initial compile time, however of most functions the string representation probably takes up more memory than it's compiled form and what you save when compiling is probably not significant before a few thousand lines... Just saying.
If you put this code in a table, you could compile it transparently through a metatable for minimal performance impact on repeated calls.
Example code
local code_uncompiled = {
f = [=[
local x, y = ...;
return x+y;
]=]
}
code = setmetatable({}, {
__index = function(self, k)
self[k] = assert(loadstring(code_uncompiled[k]));
return self[k];
end
});
local ff = code.f; -- code of x gets compiled here
ff = code.f; -- no compilation here
for i=1, 1000 do
print( ff(2*i, -i) ); -- no compilation here either
print( code.f(2*i, -i) ); -- no compile either, but table access (slower)
end
The beauty of it is that this compiles as needed and you don't really have to waste another thought on it, it's just like storing a function in a table and allows for a lot of flexibility.
Another advantage of this solution is that when the amount of dynamically loaded code gets out of hand, you could transparently change it to load code from external files on demand through the __index function of the metatable. Also, you can mix compiled and uncompiled code by populating the "code" table with "real" functions.
Try the one that makes the code more legible to you first. If it runs fast enough on your target machine, use that.
If it doesn't run fast enough, try the other one.
lua can ignore multiple lines by:
function dostuff()
blabla
faaaaa
--[[
ignore this
and this
maybe this
this as well
]]--
end

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Resources