Searching Elastic Search for a specific index position of an array field - elasticsearch

The records in my ES index are of the form:
person: {
firstName: "ABC",
lastName: "Def",
specialValues: [3, 6, null, 9]
}
I want to retrieve all person.speciaValues[1] that have a value 6.
Is is possible to do so, using Elastic Search?

Related

Adding fuzziness to an ElasticSearch prefix query

I have two documents:
{id: 1, name: "james"}
{id: 2, name: "james kennedy"}
I am using the match_bool_prefix API for autocomplete, and I would like to be able to match the document with id: 1 even if I incorrectly spell james.
Query: jamis
Desired output: finding document with id: 1.

find docs by a id field that not exists given document's nested field that contains ids

i have a int id field and a nested int array ids field on my docs. i want to find docs based an id field that not exists in the given document's id list.
Example:
[{
my_id: 1,
other_ids: [2,3]
},{
my_id: 2,
other_ids: [1]
},{
my_id: 3,
other_ids: [2]
}]
ty.

Paginating Distinct Values in Mongo

I would like to retrieve district values for a field from a collection in my database. The distinct command is the obvious solution. The problem is that some fields have a large number of possible values and are not simple primitive values (i.e. are complex sub-documents instead of just a string). This means the results are large causing it to overload the client I am delivering the results to.
The obvious solution is to paginate the resulting distinct values. But I cannot find a good way to optimally do this. Since distinct doesn't have pagination options (limit, skip, etc) I am turning to the aggregation framework. My basic pipeline is:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$limit: 10},
]
This gives me the first 10 unique values for myfield. To get the next page a $skip operator is added to the pipeline. So:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
But sometimes the field I am collecting unique values from is an array. This means I have to unwind it before grouping. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$myfield'}
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Other times the field I am getting unique values for may not be an array but it's parent node might be an array. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Finally sometimes I also need to filter on data inside the field I am getting distinct values. This means I sometimes need another match operator after the unwind:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$match: {... filter within list.myfield ...}},
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
The above are all simplifications of my actual data. Here is a real pipeline from the application:
[
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id": 1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$unwind":"$data.surroundings.school.nearby"},
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id":1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$group":{"_id":"$data.surroundings.school.nearby"}},
{"$sort":{"_id":1}},
{"$skip":10},
{"$limit":10}
]
I hand the same $match document to both the initial filter and the filter after the unwind because the $match document is somewhat opaque from a 3rd party so I don't really know which parts of the query are filtering outside my unwind vs inside the data I am getting distinct values for.
Is there any obvious different way to go about this. In general, my strategy is working but there are some queries where it is taking 10-15 seconds to return the results. There are about 200,000 documents in the collection although only about 60,000 after the first $match in the pipeline is applied (which can use indexes).

ElasticSearch custom score script does not preserve array ordering

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

CouchDB - hierarchical comments with ranking. Hacker News style

I'm trying to implement a basic way of displaying comments in the way that Hacker News provides, using CouchDB. Not only ordered hierarchically, but also, each level of the tree should be ordered by a "points" variable.
The idea is that I want a view to return it in the order I except, and not make many Ajax calls for example, to retrieve them and make them look like they're ordered correctly.
This is what I got so far:
Each document is a "comment".
Each comment has a property path which is an ordered list containing all its parents.
So for example, imagine I have 4 comments (with _id 1, 2, 3 and 4). Comment 2 is children of 1, comment 3 is children of 2, and comment 4 is also children of 1. This is what the data would look like:
{ _id: 1, path: ["1"] },
{ _id: 2, path: ["1", "2"] },
{ _id: 3, path: ["1", "2", "3"] }
{ _id: 4, path: ["1", "4"] }
This works quite well for the hierarchy. A simple view will already return things ordered the way I want it.
The issue comes when I want to order each "level" of the tree independently. So for example documents 2 and 4 belong to the same branch, but are ordered, on that level, by their ID. Instead I want them ordered based on a "points" variable that I want to add to the path - but can't seem to understand where I could be adding this variable for it to work the way I want it.
Is there a way to do this? Consider that the "points" variable will change in time.
Because each level needs to be sorted recursively by score, Couch needs to know the score of each parent to make this work the way you want it to.
Taking your example with the following scores (1: 10, 2: 10, 3: 10, 4: 20)
In this case you'd want the ordering to come out like the following:
.1
.1.4
.1.2
.1.2.3
Your document needs a scores array like this:
{ _id: 1, path: [1], scores: [10] },
{ _id: 2, path: [1, 2], scores: [10,10] },
{ _id: 3, path: [1, 2, 3], scores: [10,10,10] },
{ _id: 4, path: [1, 4], scores: [10,20] }
Then you'll use the following sort key in your view.
emit([doc.scores, doc.path], doc)
The path gets used as a tiebreaker because there will be cases where sibling comments have the exact same score. Without the tiebreaker, their descendants could lose their grouping (by chain of ancestry).
Note: This approach will return scores from low-to-high, whereas you probably want scores (high to low) and path/tiebreaker(low to high). So a workaround for this would be to populate the scores array with the inverse of each score like this:
{ _id: 1, path: [1], scores: [0.1] },
{ _id: 2, path: [1, 2], scores: [0.1,0.1] },
{ _id: 3, path: [1, 2, 3], scores: [0.1,0.1,0.1] },
{ _id: 4, path: [1, 4], scores: [0.1,0.2] }
and then use descending=true when you request the view.
Maybe anybody interestingly the thread on this question with variants of solutions:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/201205.mbox/thread -> theme "Hierarchical comments Hacker News style" 16/05/2012

Resources