I want to find the common doc counts of aggregation on top authors and top co-authors which are fields inside biblio data field of source in an index.
What I am currently doing is:
1.Calculating Aggregation on top 10 authors.(A,B,C,D.....).
2.Calculating Aggregation on top 10 co-authors (X,Y,Z,....).
3.Calculating doc count of intersection like count of common docs between these pairs :
[(A,X), (B,Y)....]. <-----RESULT
I tried sub-bucket aggregation but it gave me :
[A:(top 10 corresponding A), B:(top 10 corresponding B).....].
Ok, so from the comments above continue as an answer to make it easier to read and no character limit.
Comment
I don't think you can use pipeline aggregation to achieve it.
It's not a lot to process on client side i guess. only 20 records (10 for authors and 10 for co-authors) and it would be simple aggregate query.
Another option would be to just get top 10 across both fields and also simple agg query.
But if you really need intersection of both top10s on ES side go with Scripted Metric Aggregation. you can lay your logic in the code
First option is as simple as:
GET index_name/_search
{
"size": 0,
"aggs": {
"firstname_dupes": {
"terms": {
"field": "authorFullName.keyword",
"size": 10
}
},
"lastname_dupes": {
"terms": {
"field": "coauthorFullName.keyword",
"size": 10
}
}
}
}
and then you do intersection of the results on the client side.
Second would look like:
GET index_name/_search
{
"size": 0,
"aggs": {
"name_dupes": {
"terms": {
"script": {
"source": "return [doc['authorFullName.keyword'].value,doc['coauthorFullName.keyword'].value]"
}
, "size": 10
}
}
}
}
but it's not really an intersection of top10 authors and top10 coauthors. it's an intersection of all and then getting top10.
The third option is to write Scripted Metric Aggregation. Didn't have time to spend on algorithmic side of things (it should be optimized) but it might look as this one. For sure java skills will help you. Also make sure you understand all the stages of Scripted Metric Aggregation execution and performance issues you might have using it.
GET index_name/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.fnames = [:];state.lnames = [:];",
"map_script" :
"""
def key = doc['authorFullName.keyword'];
def value = '';
if (key != null && key.value != null) {
value = state.fnames[key.value];
if(value==null) value = 0;
state.fnames[key.value] = value+1
}
key = doc['coauthorFullName.keyword'];
if (key != null && key.value != null) {
value = state.lnames[key.value];
if(value==null) value = 0;
state.lnames[key.value] = value+1
}
""",
"combine_script" : "return state",
"reduce_script" :
"""
def intersection = [];
def f10_global = new HashSet();
def l10_global = new HashSet();
for (state in states) {
def f10_local = state.fnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
def l10_local = state.lnames.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(10).map(e->e.getKey()).collect(Collectors.toList());
for(name in f10_local){f10_global.add(name);}
for(name in l10_local){l10_global.add(name);}
}
for(name in f10_global){
if(l10_global.contains(name)) intersection.add(name);
}
return intersection;
"""
}
}
}
}
Just a note, the queries here assume you have keyword on those properties. If not just adjust them to your case.
UPDATE
PS, just noticed you mentioned you need common counts, not common names. not sure what is the case but instead of map(e->e.getKey()) use map(e->e.getValue().toString()). See the other answer on similar problem
Related
I ran into what seems to be a bug in Painless where if a vector function is used, say l2norm(), the outcome remains the same outcome as the first iteration. I'm using the painless script in a function score, I hope the query below sheds some light. I'm using the "exception" to see what the value is in each of the iteration, and it's every time the score of the first vector. I know this because I cycled the parameters a couple of times, and the score is everytime "stuck" on the first thing. So what I think is happening is that the function l2norm() (and all vector functions?!) are object instances that can only be instantiated one time? If that would be the case, what would be a work around?
Link to the ES discussion: https://discuss.elastic.co/t/painless-bug-using-for-loops-and-vector-functions/267263
{
"query": {
"nested": {
"path": "media",
"query": {
"function_score": {
"boost_mode": "replace",
"query": {
"bool": {
"filter": [{
"exists": {
"field": "media.full_body_dense_vector"
}
}]
}
},
"functions": [{
"script_score": {
"script": {
"source": "if (params.filterVectors.size() > 0 && params.filterCutOffScore >= 0) {\n for (int i=0; i < params.filterVectors.size();i++) {\n def c = params.filterVectors[i]; double euDistance = l2norm(c, doc['media.full_body_dense_vector']);\n if (i==1) { throw new Exception(euDistance + ''); } \n }\n return 1.0f;",
"params": {
"filterVectors":[
[1.0,2.0,3.0],[0.1,0.4,0.5]
],
"filterCutOffScore": 1.04
},
"lang": "painless"
}
}
}]
}
}
}
},
"size": 500,
"from": 0,
"track_scores": true
}
While l2norm is a static method, it certainly shouldn't behave like a pure function!
I've investigated a bit and it seems there's only a loop-level bug. When you call l2norm outside of the loop with either parametrized or hard-coded vectors, the results will always be different -- as they should be. But not within the for loop (I've tested a while loop too -- same result). Here's a minimum reproducible example that could be used to report a bug on github:
"script": {
"source": """
def field = doc['media.full_body_dense_vector'];
def hardcodedVectors = [ [1,2,3], [0.1,0.4,0.5] ];
def noLoopDistances = [
l2norm(hardcodedVectors[0], field),
l2norm(hardcodedVectors[1], field)
];
def hardcodedDistances = [];
for (vector in hardcodedVectors) {
double euDistance = l2norm(vector, field);
hardcodedDistances.add(euDistance);
}
def parametrizedDistances = [];
for (vector in params.filterVectors) {
double euDistance = l2norm(vector, field);
parametrizedDistances.add(euDistance);
}
def comparisonMap = [
"no-loop": noLoopDistances,
"hardcoded": hardcodedDistances,
"parametrized": parametrizedDistances
];
Debug.explain(comparisonMap);
""",
"params": {
"filterVectors": [ [1,2,3], [0.1,0.4,0.5] ]
},
"lang": "painless"
}
which yields
{
"no-loop":[
8.558621384311845, // <-- the only time two different l2norm calls behave correctly
11.071133967619906
],
"parametrized":[
8.558621384311845,
8.558621384311845
],
"hardcoded":[
8.558621384311845,
8.558621384311845
]
}
What this tells me is that it's not a matter of runtime caching but rather something else that should be investigated further be the Elastic team.
The workaround, for now, would be to keep using the parametrized vectors but instead of looping perform stone-age-like checks:
if (params.filterVectors.length == 0) {
// default to something
} else if (params.filterVectors.length == 1) {
// call l2norm once
} else if (params.filterVectors.length == 2) {
// call l2norm twice, separately
}
P.S. Throwing a new Exception() in order to debug Painless is fine. Using Debug.explain is even better for reasons explained in this sub-chapter on Debugging of my Elasticsearch Handbook.
First off, thanks to Joe for confirming I wasn't imagining things and it's indeed a bug. Second, the lovely ElasticSearch team has been triaging the issue and confirmed it's a bug, so the answer to this post is a link to the Github Issue so in the future, people can track in which ElasticSearch version this behaviour is patched.
I have two documents in Elastic search with the following values
uid preferences
1 [10,20,30,40,50,60,70,80,100]
2 [20,70,30,100,1000,77,45]
Is there any way we can do array intersect on preferences for these two records and get the result [20,70,30,100] ? Currently we are getting these two records to app server and doing intersect , but wanted to check if there is any direct way of getting the intersect values from Elasticsearch directly .Thank You .
I'd solve this using a parameterized scripted metric aggregation. Here's a more readable version:
{
"size": 0,
"query": {
"terms": {
"id": [
1,
2
]
}
},
"aggs": {
"preferences_intersection": {
"scripted_metric": {
"init_script": "state.shared_vals = [];",
"map_script": "state.shared_vals.addAll(new ArrayList(doc['preferences']));",
"combine_script": """
return state.shared_vals.stream()
.filter(i -> Collections.frequency(state.shared_vals, i) >= params.compared_docs_count)
.sorted((o1, o2) -> o1.compareTo(o2))
.collect(Collectors.toCollection(TreeSet::new))
""",
"reduce_script": "return states[0]",
"params": {
"compared_docs_count": 2
}
}
}
}
}
Notice how the terms query was applied along with params.compared_docs_count so we can check the # of occurrences of the common values.
Here's the compact version of the query without triple quotes:
{"size":0,"query":{"terms":{"id":[1,2]}},"aggs":{"preferences_intersection":{"scripted_metric":{"init_script":"state.shared_vals = [];","map_script":"state.shared_vals.addAll(new ArrayList(doc['preferences']));","combine_script":" return state.shared_vals.stream()\n .filter(i -> Collections.frequency(state.shared_vals, i) >= params.compared_docs_count)\n .sorted((o1, o2) -> o1.compareTo(o2))\n .collect(Collectors.toCollection(TreeSet::new))","reduce_script":"return states[0]","params":{"compared_docs_count":2}}}}}
If you use Elasticsearch 5.5 with Dynamic field mapping
and use double values. These values are getting the float type when I check in the mappings. When you are using an aggregation than the key in the buckets will be losing precision. the Value 0.62 would be something like 0.6200000047683716.
Code fragment
"aggregations": {
"float_numbers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0.6200000047683716,
"doc_count": 1
}
]
}
}
Here is the same issue described.
link
I am posting this issue because I found an appropriate solution which I not yet have seen but it helped me a lot.
The solution is to make the float a double. This can be achieved with Dynamic templates.
dynamic templates
dynamic field mapping
Example solution:
Add dynamic_templates in index there are no items yet.
PUT term-test
{
"mappings": {
"demo_typ": {
"dynamic_templates": [
{
"all_to_double": {
"match_mapping_type": "double",
"mapping": {
"type": "double"
}
}
}
]
}
}
}
Add data
POST term-test/demo_typ
{
"numeric_field": 0.62,
"long_filed": 44
}
Check mapping
GET term-test/_mapping
Do aggregation
GET term-test/_search
{
"query": {
"match_all": {}
},
"aggs": {
"float_numbers": {
"terms": {
"field": "numeric_field"
}
}
}
}
In the Java Api you can do the following
1: First create the index
elasticClient.admin()
.indices()
.prepareCreate(indexName)
.execute()
.actionGet();
2: Update the mapping
JSON
{
"dynamic_templates": [
{
"all_to_double": {
"match_mapping_type": "double",
"mapping": {
"type": "double"
}
}
}
]
}
Json to XContentBuilder I got the code from link
public XContentBuilder getXContentBuilderFromJson(final String json) {
try {
Map<String, Object> map = new ObjectMapper().readValue(json, new TypeReference<Map<String, Object>>() {});
return XContentFactory.jsonBuilder().map(map);
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
Update mapping
elasticClient.admin().indices()
.preparePutMapping(indexName)
.setType(yourType)
.setSource(getXContentBuilderFromJson(json))
.execute()
.actionGet();
3: Insert data
Numbers lose precision. This is because of how floating-point numbers work: 9.62 can't be expressed as a * 2 ^ b so neither doubles nor floats can represent it accurately.
Because floats and doubles cannot accurately represent a value, it is generally a bad idea to run terms aggregations on them.
As a workaround you can do Math.round after you did the aggregation
We're trying to replicate this ES plugin https://github.com/MLnick/elasticsearch-vector-scoring. The reason is AWS ES doesn't allow any custom plugin to be installed. The plugin is just doing dot product and cosine similarity so I'm guessing it should be really simple to replicate that in painless script. It looks like groovy scripting is deprecated in 5.0.
Here's the source code of the plugin.
/**
* #param params index that a scored are placed in this parameter. Initialize them here.
*/
#SuppressWarnings("unchecked")
private PayloadVectorScoreScript(Map<String, Object> params) {
params.entrySet();
// get field to score
field = (String) params.get("field");
// get query vector
vector = (List<Double>) params.get("vector");
// cosine flag
Object cosineParam = params.get("cosine");
if (cosineParam != null) {
cosine = (boolean) cosineParam;
}
if (field == null || vector == null) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": field or vector parameter missing!");
}
// init index
index = new ArrayList<>(vector.size());
for (int i = 0; i < vector.size(); i++) {
index.add(String.valueOf(i));
}
if (vector.size() != index.size()) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": index and vector array must have same length!");
}
if (cosine) {
// compute query vector norm once
for (double v: vector) {
queryVectorNorm += Math.pow(v, 2.0);
}
}
}
#Override
public Object run() {
float score = 0;
// first, get the ShardTerms object for the field.
IndexField indexField = this.indexLookup().get(field);
double docVectorNorm = 0.0f;
for (int i = 0; i < index.size(); i++) {
// get the vector value stored in the term payload
IndexFieldTerm indexTermField = indexField.get(index.get(i), IndexLookup.FLAG_PAYLOADS);
float payload = 0f;
if (indexTermField != null) {
Iterator<TermPosition> iter = indexTermField.iterator();
if (iter.hasNext()) {
payload = iter.next().payloadAsFloat(0f);
if (cosine) {
// doc vector norm
docVectorNorm += Math.pow(payload, 2.0);
}
}
}
// dot product
score += payload * vector.get(i);
}
if (cosine) {
// cosine similarity score
if (docVectorNorm == 0 || queryVectorNorm == 0) return 0f;
return score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm));
} else {
// dot product score
return score;
}
}
I'm trying to start with just getting a field from index. But I'm getting error.
Here's the shape of my index.
I've enabled delimited_payload_filter
"settings" : {
"analysis": {
"analyzer": {
"payload_analyzer": {
"type": "custom",
"tokenizer":"whitespace",
"filter":"delimited_payload_filter"
}
}
}
}
And I have a field called #model_factor to store a vector.
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer"
}
}
}
}
And this is the shape of the document
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1"
}
Here's how I use the script
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "def termInfo = doc['_index']['#model_factor'].get('1', 4);",
"lang": "painless",
"params": {
"field": "#model_factor",
"vector": [0.1,2.3,-1.6,0.7,-1.3],
"cosine" : true
}
}
},
"boost_mode": "replace"
}
}
}
And this is the error I got.
"failures": [
{
"shard": 2,
"index": "test",
"node": "ShL2G7B_Q_CMII5OvuFJNQ",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"caused_by": {
"type": "wrong_method_type_exception",
"reason": "wrong_method_type_exception: cannot convert MethodHandle(List,int)int to (Object,String)String"
},
"script_stack": [
"termInfo = doc['_index']['#model_factor'].get('1',4);",
" ^---- HERE"
],
"script": "def termInfo = doc['_index']['#model_factor'].get('1',4);",
"lang": "painless"
}
}
]
The question is how do I access the index field to get #model_factor in painless scripting?
Option 1
Due to the fact that #model_factor is a text field, in painless scripting, it would be possible to access it, setting fielddata=true in the mapping. So the mapping should be:
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer",
"fielddata" : true
}
}
}
}
And then it can be scored accessing doc-values:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return Double.parseDouble(doc['#model_factor'].get(1)) * params.vector[1];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
Problems with Option 1
So it is possible to access the field data value setting fielddata=true, but in this case, the value is the vector index as a term, not the value of the vector which is stored in the payload. Unfortunately, it looks like there is no way to access the Token Payload (where the real vector index value is stored) using painless scripting and doc-values. See the source code for elasticsearch and another similar question re: accessing term info.
So the answer is that using painless scripting is NOT possible to access the payload.
I tried also to store the vector values with a simple pattern tokenizer but when accessing the term vector values the order is not preserved, and this is probably the reason for which the author of the plugin decided to use the term as a string and then retrieve the position 0 of the vector as the term "0" and then find the real vector value in the payload.
Option 2
A very simple alternative is to use n fields in the documents, each of them represents a position in the vector, so in your example, we have a 5 dim vector with values stored in v0...v4 directly as double:
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1",
"v0" : 1.2,
"v1" : 0.1,
"v2" : 0.4,
"v3" : -0.2,
"v4" : 0.3
}
and then the painless scripting should be:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return doc['v0'].getValue() * params.vector[0];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
It should be easily possible to iterate on the input vector length and get the fields dynamically to calculate the dot product modifying doc['v0'].getValue() * params.vector[0] that I wrote for simplicity.
Problems with Option2
Option 2 is viable as long as the vector dimension remains not big. I think that default Elasticsearch max number of fields per document is 1000, but it can be changed also in AWS environment:
curl -X PUT \
'https://.../indexName/_settings' \
-H 'cache-control: no-cache' \
-H 'content-type: application/json'
-d '{
"index.mapping.total_fields.limit": 2000
}'
Moreover, it should be tested also the script speed on a large number of documents.
Maybe in re-scoring / re-ranking scenarios, it is a viable solution.
Option 3
The third option is really an experiment and the most fascinating in my opinion.
It tries to exploit the internal Elasticsearch representation of the Vector Space Model and does not use any scripting to score but reuse the default similarity score based on tf/idf.
Lucene, that seats at Elasticsearch core, is already using internally a modification of the cosine similarity to calculate the similarity score between documents in his Vector Space Model representation of terms as the formula below, taken from the TFIDFSImilarity javadoc, shows:
In particular, the weights of the vector representing the field are the tf/idf values of the terms of that field.
So we could index a document with termvectors, using as term the index of the vector. If we repeat it N times, we represent the value of the vector, exploiting the tf part of the scoring formula.
This means that the domain of the vector should be transformed and rescaled in {1.. Infinite} Positive Integer numbers domain. We start from 1 so that we are sure that all the documents contain all the terms, it will make it easier to exploit the formula.
For example, the vector: [21, 54, 45] can be indexed as a field in a document using a simple whitespace analyzer and the following value:
{
"#model_factor" : "0<repeated 21 times> 1<repeated 54 times> 2<repeated 45 times>",
"name": "Test 1"
}
then to query, i.e. calculate the dot product, we boost the single terms that represent the index position of the vector.
So using the same example above the input vector: [45, 1, 1] will be transformed in the query:
"should": [
{
"term": {
"#model_factor": {
"value": "0",
"boost": 45
}
}
},
{
"term": {
"#model_factor": "1" // boost:1 by default
}
},
{
"term": {
"#model_factor": "2" // boost:1 by default
}
}
]
norm(t,d) should be disabled in the mapping so that it is not used in the formula above. The idf part is constant for all the documents because all of them contains all the terms (having all the vectors the same dimension).
queryNorm(q) is the same for all the documents in the formula above so it is not a problem.
coord(q,d) is a constant because all the documents contain all the terms.
Problems with Option 3
Need to be tested.
It works only for positive numbers vectors, see this question in math stackoverflow for making it works also for negative numbers.
It is not the exact same of a dot product but very close to find similar documents based on raw vectors.
Scalability on large vector dimension can be an issue at querying time because this means we need to do a N dim terms query with different boosting.
I will try it in a test index and edit this question with the results.
I have documents of following format in an elastic search index:
{
"item":"Firefox",
"tags":["a","b","c"]
},
{
"item":"Chrome",
"tags":["b","c","d"]
}
I want to group by each element in the tags property, so that I get results like:
"a" = 1, "b" = 2, "c" = 2, "d" = 1
Any help or pointers would be appreciated.
If you index (write) your document to,
index= x, type =y then ,
POST x/y/_search
{
"size":0,
"aggs":{
"t":{
"terms" :{
"field" :"tags"
}
}
}
}
To know its working, just learn elasticsearch.