Probabilistic set for very low probability - data-structures

I'm looking for a set data structure that's optimised for a very low probability that an item is part of the set.
The use case is the Gnip/Twitter compliance firehose where we get approx 1,000 events per second (that's deletions from all of Twitter). We have a table of, let's say 10 million stored tweets (growing by that amount each year), and if an item appears in the firehose I have to delete it. I'm guessing there will be a match every 100,000 seconds (to pull a number out of the air).
I had thought of a bloom filters, possibly several chained, but given that there's a very low chance of a hit, I'm always going to need to go through the entire chain and things would eventually get linear.
Is there a good sublinear data structure for this?

I don't see the problem. It seems to me that if checking the Bloom filter tells you that you have the tweet stored, you then look up that tweet in your data store. If it's there, you delete it. If it's not there, you don't delete it.
You have 10 million stored tweets, and you expect it to grow by about 10 million per year. So build a Bloom filter that has a capacity of a billion, with a 0.1% probability of false positives. According to the Bloomfilter calculator, that will cost you 1.67 gigabytes.
Understand, that "false positives" number assumes that the filter contains the 1 billion keys. When your filter is very sparsely populated, the probability of false positives is much lower.
If you're getting a thousand tweets per second and the Bloom filter has a false positive rate of 0.1%, then in the worst case you'll get an average of one false positive per second. So once per second your code will have to hit the database to determine if the tweet is there.
But it'll be many years before you get to that. With only 10 million existing records and a growth rate of 10 million per year, it'll be 10 years before the filter is even 10% full. You could probably drop the filter size to 500 million (860 MB), and still not notice a big hit due to false positives.

A Bloom Filter should be fine, assuming that it fits in memory. If it won't fit completely in memory, consider using the solution described in this paper.
Alternatively, if you really want to squeeze a bit extra performance, you can use a Cuckoo Filter, but it will be harder for you to find an open source implementation; here is one in Go.

Related

Elasticsearch query over huge data

We have over 100 million data store in Elasticsearch.
Dataset are too much to be fully loaded into our service memory.
Each data has a column called amount. The search is to find out several (sometimes over 10 thousand) target data that their sum of the amount equals or close to an input value.
Below is out current solution:
We merge the 100 million data input 4000 buckets by using ES's bucket. Each bucket's amount is the sum of every data it contains.
We load the 4000 buckets into our service. Afterwards we find out the solution mentioned above based on the 4000 buckets.
The obvious disadvantage is the lack of accuracy. The difference between the sum of results we find and the input target is sometimes quite large.
We are three young guys lack of experience, we need some instructions.

Redis GEORADIUS with one ZSET versus a lot of ZSETs of particular size

What will work faster, one big ZSET with geodata where I'll query for 100m radius with GEORADIUS
OR
a lot of ZSETs where each ZSET is responsible for 100m X 100m square covering the whole world? and named after this 100m squares like:
left_corner1_49_2440000_28_5010000
left_corner2_49_2450000_28_5010000
.......
and have all the 100 meters to the right and bottom inside the sets.
So when searching for the nearest point I'll just omit the redundant digits from gps like: 49.2440408, 28.5011694 will become
49.2440000, 28.5010000 so this way I'll know the ZSETS's name where just to get all the exact values with 100 meters precision.
OR to question it in general form: how are the ZSET's names are stored and accessed in redis? If I have too much ZSETS will it impact performance while accessing them?
Precise comparison of this approaches could only be done via benchmark and it would be specific to your dataset and configuration. But architecturally speaking, your pros and cons are:
BIG ZSET: less bandwidth and less operations (CPU cycles) taken to execute, no problems on borders (possible duplicates with many ZSETS), can get throughput with sharding;
MANY ZSETS: less latency for other operations (while big ZSET is going, other commands are waiting), can get throughput with sharding AND latency with clustering.
As for bottom line question, I did not see implementation code, but set names should be the same keys as any other keys you use. This is what Redis FAQ says about number of keys:
What is the maximum number of keys a single Redis instance can hold? <...>
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance.
UPDATE:
Look at what Redis docs say about GEORADIUS:
Time complexity: O(N+log(M)) where N is the number of elements inside
the bounding box of the circular area delimited by center and radius
and M is the number of items inside the index.
It means that items outside of your query make O(log(M)) impact on your query. So, 17 hops for 10m items or 21 hop for 1b items which is quite affordable. The question left is will you do partitioning between nodes?

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
This is a bit long for a comment.
The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.
It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.
What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

Increase Solr performance when querying a subset of documents

The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.

Performance with sequentially increasing primary key

Looking for guidance on selecting a database provider for a specific key pattern.
The only key field will be a pre-allocated unique sequentially-increasing number.
During each day between
50 and 100 thousand items will be added,
processed (updated), and then retained for a week or so,
after which usually the lowest-numbered records will be deleted. The number of
records will not fluctuate by very much from day to day but may drop at weekends.
The numbers will probably wrap back to 1 after 100M or so.
I need to find a database implementation where the efficiency of the index lookup,
addition and deletion remains constant. Should I be worried that the performance might drop off as the key value range moves continuously upwards?
index lookup, addition and deletion remains constant
You could ensure it remains constant by rebuilding the indexes every insert (just constantly really slow - no performance drop off at all :)), or close to constant by running index maintenance every hour/day etc.
that the performance might drop off as the key value range moves continuously upwards?
As long as you've got an index, it should be logN performance - e.g. having 1,000,000 rows will be around half the speed of 1,000 rows (when searching for an indexed value). (1,000,000,000,000 will be half that speed again).
So no, you shouldn't need to worry about performance.
The numbers will probably wrap back to 1 after 100M or so.
Ok - if you want. Generally, no need really - just use a big int.
As always with performance: test what you want to do. Make a script that inserts 10,000,000 rows, and see what happens.
My point here being that if you're going to wrap ids at 100M records, the worst you can do is actually have them all allocated. This would represent the fragmented index condition as well (where you only have say 100K records, but they're distributed in a space of 10M) - but you will do index/database maintenance right?

Resources