How do I use eland to upload a CSV containing dense vectors to Elasticsearch? - elasticsearch

I want to use approximate KNN search in Elasticsearch. This requires some fields to have a dense vector mapping type with additional parameters. I would like to use eland to upload CSVs of my data, including the embeddings.
For example, I have the following data frame.
f = DataFrame({"text":["blue square", "red triangle"], "embedding":[[1.0, 2.0], [3.0, 4.0]]})
text embedding
0 blue square [1.0, 2.0]
1 red triangle [3.0, 4.0]
I want to run pandas_to_eland(f, ...) to create a new index containing the data in the CSV. I want the "embedding" field to be a dense vector that can be used by KNN search, which requires that its mapping look like this:
"embedding": {
"type": "dense_vector",
"dims": 2,
"index": true,
"similarity": "cosine"
}
Obviously I could just manually create the mappings for the entire CSV before indexing the data, but that will get tedious for larger data frames. I'd like to have eland/Elasticsearch figure out all the mappings and customize my "embedding" column in code at the time the index is created.
pandas_to_eland(f, ...) does have an es_type_overrides parameter that allows you to specify a mapping data type, but it does not enable any further mapping customization.
Is there some way to get eland to do this without manually creating the entire mapping beforehand?

Related

Efficient data-structure to searching data only in documents a user can access

Problem description:
The goal is to efficiently query strings from a set of JSON documents while respecting document-level security, such that a user is only able to retrieve data from documents they have access to.
Suppose we have the following documents:
Document document_1, which has no restrictions:
{
"id": "document_1",
"set_of_strings_1": [
"the",
"quick",
"brown"
],
"set_of_strings_2": [
"fox",
"jumps",
"over",
],
"isPublic": true
}
Document document_2, which can only be accessed by 3 users:
{
"id": "document_2",
"set_of_strings_1": [
"the"
"lazy"
],
"set_of_strings_2": [
"dog",
],
"isPublic": false,
"allowed_users": [
"Alice",
"Bob",
"Charlie"
]
}
Now suppose user Bob (has access to both documents) makes the following query:
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be the union of set_of_strings_1 from both documents:
["the", "quick", "brown", "lazy"]
Now suppose user Dave (has access to document_1 only) makes the following query:
getStrings(
user_id: "Dave",
set_of_strings_id: "set_of_strings_1"
)
The correct response should be set_of_strings_1 from document_1:
["the", "quick", "brown"]
A further optimization is to handle prefix tokens. E.g. for the query
getStrings(
user_id: "Bob",
set_of_strings_id: "set_of_strings_1",
token: "t"
)
The correct response should be:
["the"]
Note: empty token should match all strings.
However, I am happy to perform a simple in-memory prefix-match after the strings have been retrieved. The bottleneck here is expected to be the number of documents, not the number of strings.
What I have tried:
Approach 1: Naive approach
The naive solution here would be to:
put all the documents in a SQL database
perform a full-table scan to get all the documents (we can have millions of documents)
iterate through all the documents to figure out user permissions
filtering out the set of documents the user can access
iterating through the filtered list to get all the strings
This is too slow.
Approach 2: Inverted indices
Another approach considered is to create an inverted index from users to documents, e.g.
users
documents_they_can_see
user_1
document_1, document_2, document_3
user_2
document_1
user_3
document_1, document_4
This will efficiently give us the document ids, which we can use against some other index to construct the string set.
If this next step is done naively, it still involves a linear scan through all the documents the user is able to access. To avoid this, we can create another inverted index mapping document_id#set_of_strings_id to the corresponding set of strings then we just take the union of all the sets to get the result and then we can run prefix match after. However, this involves doing the union of a large number of sets.
Approach 3: Caching
Use redis with the following data model:
key
value
user_id#set_of_strings_id
[String]
Then we perform prefix match in-memory on the set of strings we get from the cache.
We want this data to be fairly up-to-date so the source-of-truth datastore still needs to be performant.
I don't want to reinvent the wheel. Is there a data structure or some off-the-shelf system that does what I am trying to do?

Finding all words and their frequencies in an elasticsearch index

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.
I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat
Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

Protocol buffers Fieldmask on Collections within resource

If I want to update the "amount" field within a particular element inside "f_units" collection in the below resource (protocol buffer), how will the FieldMask look like to update the amount field? Does the field mask operate on array index for collections?
{
"f_sel": {
"f_units": [
{
"id": "1",
"amount": {
"coefficient": 1000,
"exponent": -2
}
},
{
"id": "2",
"amount": {
"coefficient": 2000,
"exponent": -2
}
}
]
}
}
Will it be "f_sel.f_units.0.amount" ? How can I update the amount using FieldMask?
As far as I know, there is no way to replace individual elements of a repeated field with an index in a FieldMask.
Instead, you'd update the amount field for the element within f_units you wish to change and set the FieldMask to
"f_sel.f_units"
It would be slightly more efficient to only have to send a delta to the original list, but it would be hard to prevent bugs. For example, what if the proto was modified in the meantime and the specified index (presuming there was a way to specify one) for the repeated field was no longer in range?
As an aside, Google does propose the concept of MergeOptions which defines semantics for how repeated fields are to be handled when merging. Currently, it appears they intend for you either to replace the repeated field in its entirety or append to the end of the destination field. Both of these merging strategies avoid the aforementioned bug that could be caused by specifying an invalid index.

Geospatial marker clustering with elasticsearch

I have several hundred thousand documents in an elasticsearch index with associated latitudes and longitudes (stored as geo_point types). I would like to be able to create a map visualization that looks something like this: http://leaflet.github.io/Leaflet.markercluster/example/marker-clustering-realworld.388.html
So, I think what I want is to run a query with a bounding box (i.e., the map boundaries that the user is looking at) and return a summary of the clusters within this bounding box. Is there a good way to accomplish this in elasticsearch? A new indexing strategy perhaps? Something like geohashes could work, but it would cluster things into a rectangular grid, rather than the arbitrary polygons based on point density as seen in the example above.
#kumetix - Good question. I'm responding to your comment here because the text was too long to put in another comment. The geohash_precision setting will dictate the maximum precision at which a geohash aggregation will be able to return. For example, if geohash_precision is set to 8, we can run a geohash aggregation on that field with at most precision 8. This would, according to the reference, return results grouped in geohash boxes of roughly 38.2m x 19m. A precision of 7 or 8 would probably be accurate enough for showing a web-based heatmap like the one I mentioned in the above example.
As far as how geohash_precision affects the cluster internals, I'm guessing the setting stores a geohash string of length <= geohash_precision inside the geo_point. Let's say we have a point at the Statue of Liberty: 40.6892,-74.0444. The geohash12 for this is: dr5r7p4xb2ts. Setting geohash_precision in the geo_point to 8 would internally store the strings:
d
dr
dr5
dr5r
dr5r7
dr5r7p
dr5r7p4
dr5r7p4x
and a geohash_precision of 12 would additionally internally store the strings:
dr5r7p4xb
dr5r7p4xb2
dr5r7p4xb2t
dr5r7p4xb2ts
resulting in a little more storage overhead for each geo_point. Setting the geohash_precision to a distance value (1km, 1m, etc) probably just stores it at the closest geohash string length precision value.
Note: How to calculate geohashes using python
$ pip install python-geohash
>>> import geohash
>>> geohash.encode(40.6892,-74.0444)
'dr5r7p4xb2ts'
In Elasticsearch 1.0, you can use the new Geohash Grid aggregation.
Something like geohashes could work, but it would cluster things into a rectangular grid, rather than the arbitrary polygons based on point density as seen in the example above.
This is true, but the geohash grid aggregation handles sparse data well, so all you need is enough points on your grid and you can achieve something pretty similar to the example in that map.
Try this:
https://github.com/triforkams/geohash-facet
We have been using it to do server-side clustering and it's pretty good.
Example query:
GET /things/thing/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"geo_bounding_box": {
"Location"
: {
"top_left": {
"lat": 45.274886437048941,
"lon": -34.453125
},
"bottom_right": {
"lat": -35.317366329237856,
"lon": 1.845703125
}
}
}
}
}
},
"facets": {
"places": {
"geohash": {
"field": "Location",
"factor": 0.85
}
}
}
}

Resources