Is the return value of java api kRingDistances orderly?
Does the H3Index of two adjacent cells have other H3Index?
for example,List<List<Long>> result = h3.kRingDistances(index, k)
here is the result: [[613344576152797183], [613344576150700031, 613344574395383807, 613344575655772159, 613344575651577855, 613344576148602879, 613344576146505727], [613344576159088639, 613344574454104063, 613344574393286655, 613344574384898047, 613344574386995199, 613344575647383551, 613344575643189247, 613344575653675007, 613344576167477247, 613344576175865855, 613344576156991487, 613344576154894335]]
I'm not 100% clear on the question, but two answers here:
H3 indexes themselves are not ordered sequentially, but indexes that are numerically close to each other are also geographically close. There's an illustration of indexing order here: https://beta.observablehq.com/#nrabinowitz/h3-indexing-order
The indexes from kRing are not in guaranteed order. The guarantees here are fairly well explained in the docs - kRing output order is undefined. Other functions like kRingDistances can give you H3 indexes ordered by distance from the origin, but not necessarily ordered within the ring. hexRange and hexRing do guarantee ordered hexagons, but will fail with an error code if they encounter pentagon distortion.
Related
I'm attempting to retrieve H3 index keys directly adjacent to my current location. I'm wondering if this can be done by mutating/calculating the coordinate directly or if I have to use the library bindings to do this?
Take this example:
./bin/geoToH3 --resolution 6 --latitude 43.6533055 --longitude -79.4018915
This would return the key 862b9bc77ffffff. I now want to retrieve all relevant 6 neighbors keys (the values of the kRing I believe is how to describe it?).
A tangent though equally curious question might render the above irrelevant: if I were attempting to query entries that have all 7 indexes is there a better way than using an OR statement seeking all 7 values out? Since the index is numeric I'm wondering if I could just check for a range within the numeric representation?
The short answer is that you need to use kRing (either through the bindings or the command-line tools) to get the neighbors. While there are some limited cases where you could get the neighbors through bit manipulation of the index, in many cases the numeric index of a neighbor might be distant. The basic rule is that while indexes that are numerically close are geographically close, the reverse is not necessarily true.
For the same reason, you generally can't use a range query to look for nearby hexagons. The general lookup pattern is to find the neighboring cells of interest in code, using kRing, then query for all of them in your database.
From a data structure point of view how does Lucene filter over a range of continuous values?
I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But Lucene also provides very fast lookup on ranges of continuous values and if Lucene still uses this same compressed bit array approach, I don't understand how this happens efficiently in computation or memory. Here's my assumption, you tell me how close I am: Lucene discretizes the continuous field and creates a bit array for each of these (now discrete) values. Then when you want a range across values you grab the bit arrays corresponding to this range and just AND them together. Since I suspect there would be TONS of arrays to AND together, then maybe you do this at several levels of granularity so that you can AND together big chunks as much as possible.
UPDATE:
Oh! I just realized another alternative that would be faster. Since we know that the bit arrays for the discretized ranges mentioned above will not overlap, then we can store the bit arrays sequentially. If we have a start and end values for our range, then we can keep an index to the corresponding points in this array of bit arrays. At that point we just jump in the the array of bit arrays and start scanning it as if we're scanning a single bit array. Am I close?
Range queries (say 0 to 100) are a union of the lists of all the terms (1, 2, 3, 4, 5...) in the range. The problem is, if the range has to visit many terms, because it means processing many short term lists.
it would be better to process only a few long lists (which is what lucene is optimized for). So when you use a numeric field and index a number (like 4), we actually index it redundantly several times, adding some "fake terms" at lower precision. This allows us to instead process a range like 0 to 100 by processing say 7 terms instead of 100: "0-63", "64-95", 96, 97, 98, 99, 100. In this example "0-63" and "64-95" are the additional redundant terms that represent ranges of values.
I got a chance to study up on this in depth. The tl;dr if you don't want to read these links at the bottom:
At index time, a number such as 1234 this gets converted into multiple tokens, the original number [1234] and several tokens that represent less precise versions of the original number: [123x], [12xx], and [1xxx]. (This is somewhat of a simplification for ease of communicating the idea.)
At query time you take advantage of the fact that the less precise tokens allow us to search over ranges of tokens. So Lucene doesn't do the naive thing - a search by sweeping through the term dictionary of full precision terms, pulling out all matching numbers, and doing a Boolean OR search over all terms. Instead Lucene uses the term that covers the largest ranges possible and ORs together this much smaller set. For example, to search for all numbers that range from 1776 to 2015 Lucene would OR together these tokens: [1776], [1777], [1778], [1779], [18xx], [19xx], [200x], [2010], [2011], [2012], [2013], [2014], [2015].
Here's a nice walk though:
What's an inverted index.
How are they queried? Technically queries with naive techniques are O(n) where n is the num documents, but practically it's still a very fast O(n) and I think you'll see why when you read it.
How can they be queried faster with skip lists? (TBH I haven't read this, but I know what it's going to say.)
Inverted Indices can be used to construct fast numeric range queries when the numbers are indexed properly.
Here's the class that queries the inverted index. And it has good documentation.
Here's the class that indexes the data. and here's the most important spot in the code that does the tokenization.
I have had a look at this post about geohashes. According to the author, the final step in calculating the hash is interleaving the x and y index values. But is this really necessary? Is there a proper reason not to just concatenate these values, as long as the hash table is built according to that altered indexing rule?
From the wiki page
Geohashes offer properties like arbitrary precision and the
possibility of gradually removing characters from the end of the code
to reduce its size (and gradually lose precision).
If you simply concatenated x and y coordinates, then users would have to take a lot more care when trying to reduce precision by being careful to remove exactly the right number of characters from both the x and y coordinate.
There is a related (and more important) reason than arbitrary precision: Geohashes with a common prefix are close to one another. The longer the common prefix, the closer they are.
54.321 -2.345 has geohash gcwm48u6
54.322 -2.346 has geohash gcwm4958
(See http://geohash.org to try this)
This feature enables fast lookup of nearby points (though there are some complications), and only works because we interleave the two dimensions to get a sort of approximate 2D proximity metric.
As the wikipedia entry goes on to explain:
When used in a database, the structure of geohashed data has two
advantages. First, data indexed by geohash will have all points for a
given rectangular area in contiguous slices (the number of slices
depends on the precision required and the presence of geohash "fault
lines"). This is especially useful in database systems where queries
on a single index are much easier or faster than multiple-index
queries. Second, this index structure can be used for a
quick-and-dirty proximity search - the closest points are often among
the closest geohashes.
Note that the converse is not always true - if two points happen to lie on either side of a subdivision (e.g. either side of the equator) then they may be extremely close but have no common prefix. Hence the complications I mentioned earlier.
I have a task to perform fast search in huge in-memory array of objects by some object's fields. I need to select the subset of objects satisfying some criteria.
The criteria may be specified as a floating point value or range of such values (eg. 2.5..10).
The problem is that the float property to be searched on is not quite uniformly distributed; it could contain few objects with value range 10-20 (for example) and another million objects with values 0-1, and another million with values 100-150.
So, how possible is it to build index for effective searching those objects? Code samples are welcome.
If the in memory array is ordered then binary search would be my first attempt. Wikipedia entry has example code as well.
http://en.wikipedia.org/wiki/Binary_search_algorithm
If you're doing lookups only, a single sort followed by multiple binary searches is good.
You could also try a perfect hash algorithm, if you want the ultimate in lookup speed and little more.
If you need more than just lookups, check out treaps and red-black trees. The former are fast on average, while the latter are decent performers with a low operation duration variability.
You could try a range tree, for the range requirement.
I fail to see what the distribution of values has to do with building an index (with the possible exception of exact duplicates). Since the data fits in memory, just extract all the fields with their original position, sort them, and use a binary search as suggested by #MattiLyra.
Are we missing something?
I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes