scaled_float being presented as string - ruby

I have several indexes of type scaled_float. The problem is that when I realize a search, those indexes are presented as string instead of numeral.
on my elasticsearch document model I have the following:
indexes :price, type: 'float', index: false
and when I get to search, the result comes this way:
...,
"price": "89.00"
but I wanted to be
"price": 89.00
Is it possible?

Related

Elasticsearch performance impact on choosing mapping structure for index

I am receiving data in a format like,
{
name:"index_name",
status: "good",
datapoints: [{
paramType: "ABC",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "123"
}]
},
{
paramType: "XYZ",
batch: [{
time:"timestamp1<epoch in sec>",
value: "123"
},{
time:"timestamp2<epoch in sec>",
value: "124"
}]
}]
}
I would like to store the data into elasticsearch in such a way that I can query based on a timerange, status or paramType.
As mentioned here, I can define datapoints or batch as a nested data type which will allow to index object inside the array.
Another way, I can possibly think is by dividing the structure into separate documents. e.g.
{
name : "index_name",
status: "good",
paramType:"ABC",
time:"timestamp<epoch in sec>",
value: "123"
}
which one will be the most efficient way?
if I choose the 2nd way, I know there may be ~1000 elements in the batch array and 10-15 paramsType array, which means ~15k documents will be generated and 15k*5 fields (= 75K) key values pair will be repeated in the index?
Here this explains about the advantage and disadvantage of using nested but no performance related stats provided. in my case, there won't be any update in the inner object. So not sure which one will be better. Also, I have two nested objects so I would like to know how can I query if I use nested for getting data between a timerange?
Flat structure will perform better than nested. Nested queries are slower compared to term queries ; Also while indexing - internally a single nested document is represented as bunch of documents ; just that they are indexed in same block .
As long as your requirements are met - second option works better.

Elasticsearch array scoring

I'm using elasticsearch to search multiple array fields in my type, which looks something like
t1 = { field1: ["foo", "bar"],
field2: ["foo", "foo", "foo", "foo"]
field3: ["foo", "foo", "foo", "foo", "foo", "foo"]
}
And then I'm using a multi_match query to get matches, something along
multi_match: { query: "foo",
fields: "fields*"
}
When computing the score of t1, elasticsearch adds the score of queries in field1, field2 and field3 which is what I want. However, they are not contributing equally, field3 contributes to the score the most since "foo" occurs multiple times there.
I want now to compute the score within each array field by not adding up the score of all array entries, but by just taking the maximum of them. In my example, all fields contained would have the same score then since they all have one exact match.
This question was already asked on the elasticsearch forum, but has not been answered so far.
I've been stumped on this myself, it really seems like there should be a simple, builtin way to just specify max instead of sum.
Not sure if this is exactly what you're going for, because you lose the match score on any particular item in the array. So you're not getting max of the match score of the best particular item, just a boolean value if anything matches. If it's something more nuanced (say a person's full name, where you want a better match for first and last vs just one or the other) this may not be acceptable because you're throwing out your scores.
If it is acceptable, this workaround seems to work:
{function_score: {
query: {bool: {should: [
{term: {field1: 'foo'}},
{term: {field2: 'foo'}},
{term: {field3: 'foo'}},
]}},
functions: [
{filter: {term: {field1: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
],
score_mode: 'sum',
boost_mode: 'replace',
}}
We need the "query" part to give us the results to further filter, even though we discard the score. This seems like it should really be a filter, but just wrapping this same thing in the filtered query doesn't work. There may be a better option here.
Then, the weight functions just basically give a 1 if there's a match on that field and 0 otherwise. The score_mode tells it to sum those weights, so in your case they all match so we get 3. The boost_mode tells how to combine with the original query, "replace" tells it to ignore the original query score (which has the problem you mentioned that multiple matches in an array are being summed). So, the total score of this query is 3 because there are 3 matches.
It seems more complicated to me, but in my relatively limited testing I haven't noticed performance issues or anything. I'd love to see a better answer if someone more familiar with elasticsearch has one.

ElasticSearch Aggregations: subtracting aggregations based upon match

Using a simple albeit somewhat artificial example, let's say that I have several inventory docs stored in ElasticSearch where every document represents either the purchase or the sale of an item:
[
{item_id: "foobar", type: "cost", value: 12.34, timestamp:149382734621},
{item_id: "bizbaz", type: "sale", value: 45.12, timestamp:149383464621},
{item_id: "foobar", type: "sale", value: 32.74, timestamp:149384824621},
{item_id: "foobar", type: "cost", value: 12.34, timestamp:149387435621},
{item_id: "bizbaz", type: "sale", value: 45.12, timestamp:149388434621},
{item_id: "bizbaz", type: "cost", value: 41.23, timestamp:149389424621},
{item_id: "foobar", type: "sale", value: 32.74, timestamp:149389914621},
{item_id: "waahoo", type: "sale", value: 11.23, timestamp:149389914621},
...
]
And for a specified time range I want to calculate the current profit for each item. So for example I would want to return:
foobar_profit = sum(value of all documents item_id="foobar" and type="sale")
-sum(value of all documents item_id="foobar" and type="cost")
bizbaz_profit = sum(value of all documents item_id="bizbaz" and type="sale")
-sum(value of all documents item_id="bizbaz" and type="cost")
...
There are two aspects that I don't yet understand how to achieve.
I know how to aggregate over terms, so this would allow me to sum the value of of all "foobar" items regardless of type. But I don't know how to sum over all documents that match on two fields. For instance, I want to aggregate the above data set on the compound key (item_id,type). The dataset above would then yield the aggregations:
(foobar,cost)->24.68
(foobar,sale)->65.48
(bizbaz,cost)->41.23
(bizbaz,sale)->90.24
(waahoo,sale)->11.23
Presuming I can do #1, I will have aggregations like foobar_cost and foobar_sale. But I don't know how to combine two aggregations so that in this case foobar_profit = foobar_sale - foobar_cost. So the above aggregations would become
foobar_profit->40.8
bizbaz_profit->49.01
waahoo_profit->11.23
Some final notes:
In the example above, I only list 3 item_ids, but consider that there will be thousands of item_ids, so I can't do special-case queries per item_id.
Also, for a particular item, the cost and sale items will come in at different times, so we can't put the cost and sale price in the same document and diff the fields.
I can send back all the data and do the last step of the aggregations client side, but this might be a ton of data. Really, I need to do it on server side if possible so that I can sort the results by profit and return the top N.
You can just use nested aggregations. See here for a working example: https://gist.github.com/mattweber/71033b1bf2ebed1afd8e
I use a MatchAll Query in this example but you can replace that with a RangeQuery or whatever you need.

How do I model my document in MongoDB to make it paginable for nested attributes?

I'm trying to cache my tweets and show that based on my keyword save. However, as tweets grow overtime I need to paginate them.
I'm using Ruby and Mongoid which this is what I have come up so far.
class SavedTweet
include Mongoid::Document
field :saved_id, :type => String
field :slug, :type => String
field :tweets, :type => Array
end
And the tweets array would be like this
{id: "id", text: "text", created_at: "created_at"}
So it's like a bucket for each keyword that you can save. My first problem would be that Mongodb cannot sort the second level of document which in this case it's tweets and that'd make pagination much harder because I cannot use skip and limit. I will have to load the whole tweets and put that in the cache and paginate from that.
The question is how should I model my problem to make it paginable out of Mongodb and not in the memory. I'm assuming that doing it in Mongodb would be faster. Right now, I'm in the early stage of my application so it's easier to change the model than later. If you guys have any suggestions or opinion I'm really appreciated.
An option could be to save tweets in a different collection and link them with your SavedTweet class. It will be easy to query and you could use skip and limit without problems.
{id: "id", text: "text", created_at: "created_at", saved_tweet:"_id"}
EDIT: a better explanation, with two aditional options
As far I see, you have three options, if I understand correctly your requirements:
Use the same schema that you are already using. You would have two problems: you cannot use skip and limit with an usual query and you have a limit of 16 MB per document. I think, the first one could be resolved with an Aggregation Framework query ($unwind, $skip and $limit could be helpful). The second one could be a problem if you have a lot of tweet documents in the array, because one document cannot have more than 16MB of size.
Use two collections to store your tweets. One collection would have the same structure that you already have. For example:
{
save_id:"23232",
slug:"adkj"
}
And the other collection would have one document per tweet.
{
id: "id",
text: "text",
created_at: "created_at",
saved_tweet:"_id"
}
With saved_tweet field you are linking saved_tweets with tweet with a 1 to N relation. So with this way, you can carry out queries over tweet collection and still be able to use limit and skip operators..
Save all info in the same document. If your saved_tweet collection only have those fields, you can save all info in a whole document (one document for each tweet). Something like this:
{
save_id:"23232",
slug:"adkj"
tweet:
{
tweet_id: "id",
text: "text",
created_at: "created_at"
}
}
Whit this solution you are duplicating fields, because *save_id* and slug would be the same in other documents of the same saved_tweet, but I could be an option if you have a little quantity of fields and that fields are not subdocuments or arrays.
I hope it is clear now.

Elasticsearch: field "title" was indexed without position data; cannot run PhraseQuery

I have an index in ElasticSearch with the following mapping:
mappings: {
feed: {
properties: {
html_url: {
index: not_analyzed
omit_norms: true
index_options: docs
type: string
}
title: {
index_options: offsets
type: string
}
created: {
store: true
format: yyyy-MM-dd HH:mm:ss
type: date
}
description: {
type: string
}
}
}
getting the following error when performing phrase search ("video games"):
IllegalStateException[field \"title\" was indexed without position data; cannot run PhraseQuery (term=video)];
Single word searches work fine. Tried "index_options: positions" as well but with no luck. Title field contains text in multiple languages, sometimes empty. Interesting that it seems to fail randomly, for example it would fail with 200K documents or 800K using the same dataset. Is there a reason some titles wouldn't get indexed with positions?
Elastic search version 0.90.5
Just in case someone else has the same issue. There was another type/table (feed2) in the same index with the same "title" field that was set to "not_analyzed".
For some reason even if you specify the type: http://elasticsearchhost.com:9200/index_name/feed/_search the other type is still being searched as well. Changing the mapping for feed2 type fixed the problem.
You probably have another field named 'title' with a different mapping in another type but in the same index.
Basically if you have 2 fields with the same name in the same index - even if they are in different types - they cannot have different mappings: to be more precise, even if they have the same type (eg: "string") but one of them is "analyzed" and the other is "not analyzed", problems will arise.
I mean, yeah, you can try to setup 2 different mappings, and ElasticSearch will not complain, but when searching you get strange result and everything will go bananas.
You can read more about this issue here where they say:
[...] In the end, we opted to enforce the rule that all fields with the same name in the same index must have the same mapping [...]
And yeah, considering how the promise of ElasticSearch has always been "it just works" this little detail took a lot of people by surprise.

Resources