Fast way to search based on non-literal comparison
I am developing a small search over rather large data sets, basically all strings. The relation between the table fields are simple enough, though the comparison mustn’t be literal. i.e. it should be able to correlate “filippo“, “philippo“, “filipo“ and so forth.
I have found a few ways it could be done, very frequently stumbling on Levinstein distance (this, here and here), though I am not sure it is practical on my specific case.
In a nutshell I have two tables, a small one with “search keys“ and a more massive one in which the search should be performed. Both tables have the same fields and they both have the same "meaning". E.g.
KEYS_TABLE
# | NAME | MIDNAME | SURNAME | ADDRESS | PHONE
1 | John | Fake | Doe | Sesame St. | 333-12-32
2 | Ralph | Stue | Michel | Bart. Ghost St. | 778-13000
...
and
SEARCH_TABLE
# | NAME | MIDNAME | SURNAME | ADDRESS | PHONE
...
532 | Jhon | F. | Doe | Sesame Street | 3331232
...
999 | Richard | Dalas | Doe | Sesame St. | 333-12-32
All I want to do is os obtain some sort of metric, or rank for each given record on KEYS_TABLE, report all records from SEARCH_TABLE above a certain relevance (defined either by the metric or simply some "KNN" like method).
I say that Levinstein distance might not be practical because it would require to calculate for every field in every row in KEYS_TABLE x SEARCH_TABLE. Considering that SEARCH_TABLE has about 400 million records and KEYS_TABLE varies from 100k to 1mil, the resulting number is way too large.
I was hoping there was some way I could previously enrich both tables, or some simpler (cheaper) way to perform the search.
Worth mentioning that I am allowed to transform the data at will. e.g. normalise St. to st, Street to st, remove special chars and so on.
What would be my options?
One approach (heuristic!) I can think about is:
In addition to the original fields in the table, for each field also store its normalized form obtained by some stemming algorithm. If you are using java, lucene's EnglishAnalyzer might help you with this step.
Do an exact comparison using the standard methods to find for each entry in table1 a list of candidates. An entry e2 in table2 will be a candidate to entry e1 in table1 if they have some common field where the normalized form matches the regular form. That can be done efficiently using some data structure that allows quick string searches - there are plenty of these.
For each entry in e1 - find the "best" candidate/s for it in the list, using the exact metric you chose (for example your suggested leneshtein distance)
You might want to do some post-processing to make sure you don't have two elements in table1 mapped to the same element in table2, if that's an issue.
Depending on what misspellings are likely, you might be able to use Soundex or Metaphone for your searches.
Related
I have an index with 0.5M of records. In my UI I want to show this data within a table paginated.
+---+---+---+---+
| A | B | C | D |
+---+---+---+---+
| | | | |
The user can sort, for example, A, C, and D columns (asc/desc). Not in conjunction, but by any of these 3 columns separately.
From what I can see the Index Sorting allows to order the data in each segment for the specified set of columns.
From my understanding, I can specify a sorting setting for the index to store column A sorted and this should make the sorting exactly by this field faster. Or I can specify A + C, and exactly A in conjunction with C should be faster.
Can I benefit from Index Sorting in my scenario? Or simply rely on ES default configuration?
Create another index with a similar `data-set and try it out .. Use Reindex API for the same. By this, you can see for yourself if it improves the performance or not.
Do you even need the optimization considering there is an over-head of the same at the index time ?
I'm a bit confused. I cannot find any information about how to execute a range query against a sorted string table.
LevelDB and RocksDB support a range iterator which allows you to query between ranges, which is perfect for NoSQL. What I don't understand is how it is implemented to be efficient.
The tables are sorted in memory (and on disk) - what algorithm or data structure allows one to query a Sorted String Table efficiently in a range query? Do you just loop through the entries and rely on the cache lines being full of data?
Usually I would put a prefix tree in front, and this gives me indexing of keys. But I am guessing Sorted String Tables do something different and take advantage of sorting in some way.
Each layer of the LSM (except for the first one) is internally sorted by the key, so you can just keep an iterator into each layer and use the one pointing to the lexicographically smallest element. The files of a layer look something like this on disk:
Layer N
---------------------------------------
File1 | File2 | File3 | ... | FileN <- filename
n:File2 |n:File3|n:File4| ... | <- next file
a-af | af-b | b-f | ... | w-z <- key range
---------------------------------------
aaron | alex | brian | ... | walter <- value omitted for brevity, but these are key:value records
abe | amy | emily | ... | xena
... | ... | ... | ... | ...
aezz | azir | erza | ... | zoidberg
---------------------------------------
First Layer (either 0 or 1)
---------------------------------------
File1 | File2 | ... | FileK
alex | amy | ... | andy
ben | chad | ... | dick
... | ... | ... | ...
xena | yen | ... | zane
---------------------------------------
...
Assume that you are looking for everything in the range ag-d (exclusive). A "range scan" is just to find the first matching element and then iterate the files of the layer. So you find that File2 is the first to contain any matching elements, and scan up to the first element starting with 'ag'. You iterate over File2, then look at the next file for File2 (n:File3). You check the key-range it contains and find that it contains more elements from the range you are interested in, so you iterate it until you hit the first entry starting with 'd'. You do the same thing in every layer, except the first. The first layer has files which are not sorted among each other, but they are internally sorted, so you can just keep an iterator per file. You also keep one more for the current memtables (in-memory data, only persisted in a log).
This never becomes too expensive, because the first layer is typically compacted on a small constant threshold. As the files in every layer are sorted and the files are internally sorted by the key too, you can just advance the smallest iterator until all iterators are exhausted. Apart from the initial search, every step has to look at a fixed number of iterators (assuming a naive approach) and is thus O(1). Most LSMs employ a block cache, and thus the sequential reads typically hit the cache most of the time.
Last but not least, be aware that this is mostly a conceptual explanation, because most implementations have a few extra tricks up their sleeves that make these things more efficient. You have to know which data is contained in which file-ranges anyway when you do a major compaction, i.e., merge layer N in to layer N + 1. Even the file-level operation may look quite different: RocksDB, e.g., maintains a coarse index with the key offsets at the beginning of each file to avoid scanning over the often much larger key/value pair portion of the file.
I am using Spark for some large data processing. But I think this problem is kind of independent. I have following data set with some other columns:
--------------------------------------------------
| Name | Corrected_Name |
--------------------------------------------------
| Pawan | Varun |
--------------------------------------------------
| Varun | Naresh |
--------------------------------------------------
| Dona | Pia |
--------------------------------------------------
Now I am trying to correct all the names so in this case I will have to find the chain Pawan -> Varun -> Naresh. Is there a way tto handle this in Spark or some other algorithm?
First of all, note that names are commonly a bad identifier due to frequent duplication. If you would eventually have to "squash" the chain (transform 2 rows into one), reducing by name itself will cause chaos.
Regarding the original question, this is a common case where iterative calculations should be made, this type of use-case has two possible directions:
In memory (assumptions should be made over the data) - collect all the data into a single machine, perform the mapping in memory and Broadcast the result to other machines.
Distributed mapping (assumes nothing about the data, very expensive) - perform distributed next step-lookup, can be optimized to perform up to log(n) join-cache-count operations
pyspark code example for (2):
forward = get_all_data()
current_count = -1
while(current_count != 0):
forward = forward.selectExpr("Name", "Corrected_Name as original_last_name", "Corrected_Name as connection").join(forward.selectExpr("Corrected_Name as Corrected_Name_Tmp", "Name as connection"), "connection", "left")
forward_clean = forward.withColumn("Corrected_Name", merge_udf(col("original_last_name"), col("Corrected_Name_Tmp"))).cache()
current_count = forward_clean.filter(forward_clean.Corrected_Name_Tmp.isNotNull()).count()
forward = forward_clean.drop(col("original_last_name")).drop(col("Corrected_Name_Tmp")).drop(col("connection"))
This code results in all rows, each one has a mapping from original "Name" to last element in the "Corrected_Name" chain.
Note: (2) is very wasteful but assumes nothing, it can be optimized to perform at log(n) by making looking harder, looking can be optimized if you need only the first element in each chain. (1) is preferred calculation-size but you will have to benchmark the memory footprint
I have a dataset where each record could contain a different number of features.
The features in total are 56, and each record can contain from 1 to 56 record of this features.
Each features is like a flag, so or exist in the dataset or not, and if it exist, there is another value, double, that put the value of it.
An example of dataset is this
I would know if is possibile training my kNN algorithm using different features for each record, so for example one record has 3 features plus label, other one has 4 features plus label, etc...
I am trying to implement this in Python, but I have no idea about how I have to do.
Yes it is definitely possible. The one thing you need to think about is the distance measure.
The default distance used for kNN Classifiers is usually Euclidean distance. However,Euclidean distance requires records (vectors) of equal number of features (dimensions).
The distance measure you use, highly depends on what you think should make records similar.
If you have a correspondence between features of two records, so you know that feature i of record x describes the same feature as feature i from record y you can adapt Euclidean distance. For example you could either ignore missing dimensions (such that they don't add to the distance if a feature is missing in one record) or penalize missing dimensions (such that a certain penalty value is added whenever a feature is missing in a record).
If you do not have a correspondence between the features of two records, then you would have to look at set distances, e.g., minimum matching distance or Hausdorff distance.
Every instance in your dataset should be represented by the same number of features. If you have data with a variable number of features (e.g. each data point is a vector of x and y where each instance has different number of points) then you should treat such points as missing values.
Therefore you need to deal with missing values. For example:
Replace missing values with the mean value for each column.
Select an algorithm that is able to handle missing values such as Decision trees.
Use a model that is able to predict missing values.
EDIT
First of all you need to bring the data into a better format. Currently, each feature is represented by two columns which is not a very nice technique. Therefore I would suggest to restructure the data as follows:
+------+------------+-----------+----------+--------+
| ID | Feature1 | Feature2 | Feature3 | Label |
+-------------------+-----------+----------+--------+
| 1 | 15.12 | ? | 56.65 | True |
| 2 | ? | 23.6 | ? | True |
| 3 | ? | 12.3 | ? | False |
+-------------------+-----------+----------+--------+
Then you can either replace missing values (denoted with ?) with 0 (this depends on the "meaning" of each feature) or use one of the techniques that I've already mentioned before.
I have 50M different texts as input from which the top (up to) 10 most relevant tags have been extracted.
There are ~100K distinct tags
I would like to develop an algorithm that, given a text id T1 as input (present in the original input data set), computes the most related text id T2 based on the fact that T2 is the text that have most tags in common with T1.
id | tags
-------------
1 | A,B,C,D
2 | B,D,E,F
3 | A,B,D,E
4 | B,C,E
In the example above, the most similar id to 1 is 3 as they have 3 tags in common
This seems to be the same algorithm that shows the most related questions in StackOverflow.
My first idea was to map both texts and tags to integers to build a big (50M * 100K) binary matrix that is very sparse.
This matrix fits in memory, but I do not know how to use it.
As this is for a web application, I would like to deliver the result in real time conditions (at most a few ms, with possible multi-threading).
My main languages are Scala and Java.
Thanks for your help