Elasticsearch - group by element in child collection - elasticsearch

I have documents of following format in an elastic search index:
{
"item":"Firefox",
"tags":["a","b","c"]
},
{
"item":"Chrome",
"tags":["b","c","d"]
}
I want to group by each element in the tags property, so that I get results like:
"a" = 1, "b" = 2, "c" = 2, "d" = 1
Any help or pointers would be appreciated.

If you index (write) your document to,
index= x, type =y then ,
POST x/y/_search
{
"size":0,
"aggs":{
"t":{
"terms" :{
"field" :"tags"
}
}
}
}
To know its working, just learn elasticsearch.

Related

elasticsearch: how to add input value with existing value for integer field

I have a document existing in ES like this:
{
"noteCount": 1,
....
}
there is another incoming document like this:
{
"noteCount": 2,
....
}
While updating the ES document, for the noteCount object, I need to save it as 3 which is 1 + 2.
Any way I can make it work?
You can use painless scripting with an update query.
PUT /test/_doc/1?pretty
{
"noteCount" : 1,
"tags" : ["red"]
}
POST /test/_update/1?pretty
{
"script" : {
"source": "ctx._source.noteCount += params.count",
"lang": "painless",
"params" : {
"count" : 1
}
}
}
GET /test/_doc/1

Finding common values of array field of two documents in ElasticSearch

I have two documents in Elastic search with the following values
uid preferences
1 [10,20,30,40,50,60,70,80,100]
2 [20,70,30,100,1000,77,45]
Is there any way we can do array intersect on preferences for these two records and get the result [20,70,30,100] ? Currently we are getting these two records to app server and doing intersect , but wanted to check if there is any direct way of getting the intersect values from Elasticsearch directly .Thank You .
I'd solve this using a parameterized scripted metric aggregation. Here's a more readable version:
{
"size": 0,
"query": {
"terms": {
"id": [
1,
2
]
}
},
"aggs": {
"preferences_intersection": {
"scripted_metric": {
"init_script": "state.shared_vals = [];",
"map_script": "state.shared_vals.addAll(new ArrayList(doc['preferences']));",
"combine_script": """
return state.shared_vals.stream()
.filter(i -> Collections.frequency(state.shared_vals, i) >= params.compared_docs_count)
.sorted((o1, o2) -> o1.compareTo(o2))
.collect(Collectors.toCollection(TreeSet::new))
""",
"reduce_script": "return states[0]",
"params": {
"compared_docs_count": 2
}
}
}
}
}
Notice how the terms query was applied along with params.compared_docs_count so we can check the # of occurrences of the common values.
Here's the compact version of the query without triple quotes:
{"size":0,"query":{"terms":{"id":[1,2]}},"aggs":{"preferences_intersection":{"scripted_metric":{"init_script":"state.shared_vals = [];","map_script":"state.shared_vals.addAll(new ArrayList(doc['preferences']));","combine_script":" return state.shared_vals.stream()\n .filter(i -> Collections.frequency(state.shared_vals, i) >= params.compared_docs_count)\n .sorted((o1, o2) -> o1.compareTo(o2))\n .collect(Collectors.toCollection(TreeSet::new))","reduce_script":"return states[0]","params":{"compared_docs_count":2}}}}}

Elasticsearch Common of two Aggregations

I want to find the common doc counts of aggregation on top authors and top co-authors which are fields inside biblio data field of source in an index.
What I am currently doing is:
1.Calculating Aggregation on top 10 authors.(A,B,C,D.....).
2.Calculating Aggregation on top 10 co-authors (X,Y,Z,....).
3.Calculating doc count of intersection like count of common docs between these pairs :
[(A,X), (B,Y)....]. <-----RESULT
I tried sub-bucket aggregation but it gave me :
[A:(top 10 corresponding A), B:(top 10 corresponding B).....].
Ok, so from the comments above continue as an answer to make it easier to read and no character limit.
Comment
I don't think you can use pipeline aggregation to achieve it.
It's not a lot to process on client side i guess. only 20 records (10 for authors and 10 for co-authors) and it would be simple aggregate query.
Another option would be to just get top 10 across both fields and also simple agg query.
But if you really need intersection of both top10s on ES side go with Scripted Metric Aggregation. you can lay your logic in the code
First option is as simple as:
GET index_name/_search
{
"size": 0,
"aggs": {
"firstname_dupes": {
"terms": {
"field": "authorFullName.keyword",
"size": 10
}
},
"lastname_dupes": {
"terms": {
"field": "coauthorFullName.keyword",
"size": 10
}
}
}
}
and then you do intersection of the results on the client side.
Second would look like:
GET index_name/_search
{
"size": 0,
"aggs": {
"name_dupes": {
"terms": {
"script": {
"source": "return [doc['authorFullName.keyword'].value,doc['coauthorFullName.keyword'].value]"
}
, "size": 10
}
}
}
}
but it's not really an intersection of top10 authors and top10 coauthors. it's an intersection of all and then getting top10.
The third option is to write Scripted Metric Aggregation. Didn't have time to spend on algorithmic side of things (it should be optimized) but it might look as this one. For sure java skills will help you. Also make sure you understand all the stages of Scripted Metric Aggregation execution and performance issues you might have using it.
GET index_name/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.fnames = [:];state.lnames = [:];",
"map_script" :
"""
def key = doc['authorFullName.keyword'];
def value = '';
if (key != null && key.value != null) {
value = state.fnames[key.value];
if(value==null) value = 0;
state.fnames[key.value] = value+1
}
key = doc['coauthorFullName.keyword'];
if (key != null && key.value != null) {
value = state.lnames[key.value];
if(value==null) value = 0;
state.lnames[key.value] = value+1
}
""",
"combine_script" : "return state",
"reduce_script" :
"""
def intersection = [];
def f10_global = new HashSet();
def l10_global = new HashSet();
for (state in states) {
def f10_local = state.fnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
def l10_local = state.lnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
for(name in f10_local){f10_global.add(name);}
for(name in l10_local){l10_global.add(name);}
}
for(name in f10_global){
if(l10_global.contains(name)) intersection.add(name);
}
return intersection;
"""
}
}
}
}
Just a note, the queries here assume you have keyword on those properties. If not just adjust them to your case.
UPDATE
PS, just noticed you mentioned you need common counts, not common names. not sure what is the case but instead of map(e->e.getKey()) use map(e->e.getValue().toString()). See the other answer on similar problem

Elasticsearch filter where poi nearby

I want to perform the following pseudo query on Elastic:
SELECT SomeTypeOfObjects WHERE distance to
PointOfInterests < 20km AND PointOfInterests.type = 'bar';
I have the following data in Elastic:
SomeTypeOfObjects (around 10.000.000 rows)
----------
id = x
geo_location = x,x
PointOfInterests( around 400.000 rows )
---------
id = x
geo_location = x,x
type=bar, hospital, etc
Would this be possible without 2 queries or feeding the query all possible geo locations?
If you saved your lat,lon as geo-point, you should be able to execute a geo_distance query like this:
GET /my_locations/location/_search
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
}
Which apparently doesn't answer your question. I guess the best option would be to
1. Get coordinates of all bars
2. Use geo_distance with multiple points:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html

How can I do this in painless script Elasticsearch 5.3

We're trying to replicate this ES plugin https://github.com/MLnick/elasticsearch-vector-scoring. The reason is AWS ES doesn't allow any custom plugin to be installed. The plugin is just doing dot product and cosine similarity so I'm guessing it should be really simple to replicate that in painless script. It looks like groovy scripting is deprecated in 5.0.
Here's the source code of the plugin.
/**
* #param params index that a scored are placed in this parameter. Initialize them here.
*/
#SuppressWarnings("unchecked")
private PayloadVectorScoreScript(Map<String, Object> params) {
params.entrySet();
// get field to score
field = (String) params.get("field");
// get query vector
vector = (List<Double>) params.get("vector");
// cosine flag
Object cosineParam = params.get("cosine");
if (cosineParam != null) {
cosine = (boolean) cosineParam;
}
if (field == null || vector == null) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": field or vector parameter missing!");
}
// init index
index = new ArrayList<>(vector.size());
for (int i = 0; i < vector.size(); i++) {
index.add(String.valueOf(i));
}
if (vector.size() != index.size()) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": index and vector array must have same length!");
}
if (cosine) {
// compute query vector norm once
for (double v: vector) {
queryVectorNorm += Math.pow(v, 2.0);
}
}
}
#Override
public Object run() {
float score = 0;
// first, get the ShardTerms object for the field.
IndexField indexField = this.indexLookup().get(field);
double docVectorNorm = 0.0f;
for (int i = 0; i < index.size(); i++) {
// get the vector value stored in the term payload
IndexFieldTerm indexTermField = indexField.get(index.get(i), IndexLookup.FLAG_PAYLOADS);
float payload = 0f;
if (indexTermField != null) {
Iterator<TermPosition> iter = indexTermField.iterator();
if (iter.hasNext()) {
payload = iter.next().payloadAsFloat(0f);
if (cosine) {
// doc vector norm
docVectorNorm += Math.pow(payload, 2.0);
}
}
}
// dot product
score += payload * vector.get(i);
}
if (cosine) {
// cosine similarity score
if (docVectorNorm == 0 || queryVectorNorm == 0) return 0f;
return score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm));
} else {
// dot product score
return score;
}
}
I'm trying to start with just getting a field from index. But I'm getting error.
Here's the shape of my index.
I've enabled delimited_payload_filter
"settings" : {
"analysis": {
"analyzer": {
"payload_analyzer": {
"type": "custom",
"tokenizer":"whitespace",
"filter":"delimited_payload_filter"
}
}
}
}
And I have a field called #model_factor to store a vector.
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer"
}
}
}
}
And this is the shape of the document
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1"
}
Here's how I use the script
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "def termInfo = doc['_index']['#model_factor'].get('1', 4);",
"lang": "painless",
"params": {
"field": "#model_factor",
"vector": [0.1,2.3,-1.6,0.7,-1.3],
"cosine" : true
}
}
},
"boost_mode": "replace"
}
}
}
And this is the error I got.
"failures": [
{
"shard": 2,
"index": "test",
"node": "ShL2G7B_Q_CMII5OvuFJNQ",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"caused_by": {
"type": "wrong_method_type_exception",
"reason": "wrong_method_type_exception: cannot convert MethodHandle(List,int)int to (Object,String)String"
},
"script_stack": [
"termInfo = doc['_index']['#model_factor'].get('1',4);",
" ^---- HERE"
],
"script": "def termInfo = doc['_index']['#model_factor'].get('1',4);",
"lang": "painless"
}
}
]
The question is how do I access the index field to get #model_factor in painless scripting?
Option 1
Due to the fact that #model_factor is a text field, in painless scripting, it would be possible to access it, setting fielddata=true in the mapping. So the mapping should be:
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer",
"fielddata" : true
}
}
}
}
And then it can be scored accessing doc-values:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return Double.parseDouble(doc['#model_factor'].get(1)) * params.vector[1];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
Problems with Option 1
So it is possible to access the field data value setting fielddata=true, but in this case, the value is the vector index as a term, not the value of the vector which is stored in the payload. Unfortunately, it looks like there is no way to access the Token Payload (where the real vector index value is stored) using painless scripting and doc-values. See the source code for elasticsearch and another similar question re: accessing term info.
So the answer is that using painless scripting is NOT possible to access the payload.
I tried also to store the vector values with a simple pattern tokenizer but when accessing the term vector values the order is not preserved, and this is probably the reason for which the author of the plugin decided to use the term as a string and then retrieve the position 0 of the vector as the term "0" and then find the real vector value in the payload.
Option 2
A very simple alternative is to use n fields in the documents, each of them represents a position in the vector, so in your example, we have a 5 dim vector with values stored in v0...v4 directly as double:
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1",
"v0" : 1.2,
"v1" : 0.1,
"v2" : 0.4,
"v3" : -0.2,
"v4" : 0.3
}
and then the painless scripting should be:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return doc['v0'].getValue() * params.vector[0];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
It should be easily possible to iterate on the input vector length and get the fields dynamically to calculate the dot product modifying doc['v0'].getValue() * params.vector[0] that I wrote for simplicity.
Problems with Option2
Option 2 is viable as long as the vector dimension remains not big. I think that default Elasticsearch max number of fields per document is 1000, but it can be changed also in AWS environment:
curl -X PUT \
'https://.../indexName/_settings' \
-H 'cache-control: no-cache' \
-H 'content-type: application/json'
-d '{
"index.mapping.total_fields.limit": 2000
}'
Moreover, it should be tested also the script speed on a large number of documents.
Maybe in re-scoring / re-ranking scenarios, it is a viable solution.
Option 3
The third option is really an experiment and the most fascinating in my opinion.
It tries to exploit the internal Elasticsearch representation of the Vector Space Model and does not use any scripting to score but reuse the default similarity score based on tf/idf.
Lucene, that seats at Elasticsearch core, is already using internally a modification of the cosine similarity to calculate the similarity score between documents in his Vector Space Model representation of terms as the formula below, taken from the TFIDFSImilarity javadoc, shows:
In particular, the weights of the vector representing the field are the tf/idf values of the terms of that field.
So we could index a document with termvectors, using as term the index of the vector. If we repeat it N times, we represent the value of the vector, exploiting the tf part of the scoring formula.
This means that the domain of the vector should be transformed and rescaled in {1.. Infinite} Positive Integer numbers domain. We start from 1 so that we are sure that all the documents contain all the terms, it will make it easier to exploit the formula.
For example, the vector: [21, 54, 45] can be indexed as a field in a document using a simple whitespace analyzer and the following value:
{
"#model_factor" : "0<repeated 21 times> 1<repeated 54 times> 2<repeated 45 times>",
"name": "Test 1"
}
then to query, i.e. calculate the dot product, we boost the single terms that represent the index position of the vector.
So using the same example above the input vector: [45, 1, 1] will be transformed in the query:
"should": [
{
"term": {
"#model_factor": {
"value": "0",
"boost": 45
}
}
},
{
"term": {
"#model_factor": "1" // boost:1 by default
}
},
{
"term": {
"#model_factor": "2" // boost:1 by default
}
}
]
norm(t,d) should be disabled in the mapping so that it is not used in the formula above. The idf part is constant for all the documents because all of them contains all the terms (having all the vectors the same dimension).
queryNorm(q) is the same for all the documents in the formula above so it is not a problem.
coord(q,d) is a constant because all the documents contain all the terms.
Problems with Option 3
Need to be tested.
It works only for positive numbers vectors, see this question in math stackoverflow for making it works also for negative numbers.
It is not the exact same of a dot product but very close to find similar documents based on raw vectors.
Scalability on large vector dimension can be an issue at querying time because this means we need to do a N dim terms query with different boosting.
I will try it in a test index and edit this question with the results.

Resources