Elasticsearch data size optimization - elasticsearch

I was wondering, would it be good practice to optimize data like this in Elasticsearch?
Old data
{
"user_id": 1,
"firstname": "name",
"lastname": "name",
"email": "email"
}
New data
{
"uid": 1,
"f": "name",
"l": "name",
"e": "email"
}
Lets say, I have billions of documents with long named keys, would it save alot space if I used short named keys instead?
Or does elasticsearch compress data by default, so I don't need to worry about this?
I prefer to have data more readable, but if it could save a lot space, then its whole different thing.
This question is asked five years ago and it had only one answer, so would be nice to have more comments about this.
You can read it here:
Elasticsearch scheme optimization
Any thoughts from experienced elasticsearch developers?

Related

Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.
The scenario is the following
I have a index "data":
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "A1", "B"]
}
The requirement says that data can be queried by:
some text search in the context field
that belongs to a specific type or category
So far, so simple, so good.
This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way
For example:
create the data
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A"]
}
Then it was decided that all data with type=T1 must belong to both A & B categories.
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "B"]
}
If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.
Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?
Something like it:
Data:
{
"id": "00001",
"content" : "some text here ..",
"type": "T1"
}
DataCategories:
{
"type": "T1"
"categories" : ["A", "B"]
}
Is it acceptable/possible?
This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.
Possible options:
1 - For the best performance, reindex. It may not be practical depending on the velocity of your change
2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.
3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.
4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.
Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Elasticsearch - query based on event frequency

I have multiple indexes to store user tracking log. In which there is 1 index is index-pageview. How can I query out the list of users who viewed the page 10 times between 2021-12-11 and 2021-12-13 using IOS operating system?
Log example:
index: index-pageview
[
{
"user_id": 1,
"session_id": "xxx",
"timestamp": "2021-12-11 hh:mm:ss",
"platform": "IOS"
},
{
"user_id": 1,
"session_id": "yyy",
"timestamp": "2021-12-13 hh:mm:ss",
"platform": "Android"
}
]
You can try building a normal bool query on timestamp and platform and then either terms aggregation (possibly with min_doc_count: 10) or collapse on user_id. Both ways will have some limitations though:
aggregation might be slower (needs benchmarking)
aggregation bucket number is limited (at 10k by default)
collapse will work on at most size docs at a time (capped at 10k as well) so you might need scrolling and app-side processing
Though performance of these might be pretty poor. If you need to run queries like those very often I would consider using another storage (SQL? Something more fancy?)

Nesting relationship with ElasticSearch

I am trying to find revenue per actor in a movie. It is pretty straightforward, but here's an example of what I have now:
// without actor
{
"ID": 1,
"Timestamp": "2014-01-01 00:02:12",
"Title": "Great White Shark",
"Amount": 4.99
}
It is not an issue if I have, for example, 100M entries in financials and I ask for the aggregate where the title=GreatWhiteShark.
However, when I add in an Actor, the structure becomes extremely verbose, and probably increases my storage size by 10x --
{
"ID": 1,
"Timestamp": "2014-01-01 00:02:12",
"Title": "Great White Shark",
"Amount": 4.99,
"Actors": [Christopher Plummer,Andrew Garfield,Heath Ledger,
Lily Cole,Jude Law,Verne Troyer,Johnny Depp,
Tom Waits,George MacKay,Tom Holland,Saoirse Ronan,
Seymour Cassel,Sofia Milos]
}
This is so I can ask a question such as "How much money did movies with Christopher Plummer make in 2011?".
Is there a better way to do the above structure? My main concern is performance, and secondary would be storage size.
Performance should be very good, Elasticsearch will build an inverted index for actors array anyway. Querying an actor will return all associated movies instantly.
As for space reduction, you can try encoding each actor name to an integer id instead of actor slug. But you should try slug variant first as this does not destroy readability and integrations to Kibana, etc.
Your proposed structure is perfectly suited for Elasticsearch by all means.

How to find related related songs or artists using Freebase MQL?

I have any Freebase mid such as: /m/0mgcr, which is The Offspring.
Whats the best way to use MQL to find related artists?
Or if I have a song mid such as: /m/0l_f7f, which is Original Prankster by The Offspring.
Whats the best way to use MQL to find related songs?
So, the revised question is, given a musical artist, find all other musical artists who share all of the same genres assigned to the first artist.
MQL doesn't have any operators which can work across parts of the query tree, so this can't be done in a single query, but given that you're likely doing this from a programming language, it be done pretty simply in two steps.
First, we'll get all genres for our subject artist, sorted by the number of artists that they contain using this query (although the last part isn't strictly necessary):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count"
}]
}]
Then, using the genre with the smallest number of artists for maximum selectivity, we'll add in the other genres to make it even more specific. Here's a version of the query with the artists that match on the three most specific genres (the base genre plus two more):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count",
"limit": 1,
"a:artists": [{
"name": null,
"id": null,
"a:genre": {
"id": "/en/ska_punk"
},
"b:genre": {
"id": "/en/melodic_hardcore"
}
}]
}]
}]
Which gives us: Authority Zero, Millencolin, Michael John Burkett, NOFX, Bigwig, Huelga de Hambre, Freygolo, The Vandals
The things to note about this query are that, this fragment:
"sort": "artists.count",
"limit": 1,
limits our initial genre selection to the single genre with the fewest artists (ie Skate Punk), while the prefix notation:
"a:genre": {"id": "/en/ska_punk"},
"b:genre": {"id": "/en/melodic_hardcore"}
is to get around the JSON limitation on not having more than one key with the same name. The prefixes are ignored and just need to be unique (this is the same reason for the a:artists elsewhere in the query.
So, having worked through that whole little exercise, I'll close by saying that there are probably better ways of doing this. Instead of an absolute match, you may get better results with a scoring function that looks at % overlap for the most specific genres or some other metric. Things like common band members, collaborations, contemporaneous recording history, etc, etc, could also be factored into your scoring. Of course this is all beyond the capabilities of raw MQL and you'd probably want to load the Freebase data for the music domain (or some subset) into a graph database to run these scoring algorithms.
In point of fact, both last.fm and Google think a better list would include bands like Sum 41, blink-182, Bad Religion, Green Day, etc.

Multiple atomic updates using MongoDB?

I am using Codeigniter and Alex Bilbie's MongoDB library.
In my API that I am developing users can upload images and other users can comment on them.
I have chosen to include the comments as sub documents to the images.
Each comment contains:
Fullname (of author)
Comment
Created_at
So in other words. The users full name is "hard coded" into each comment so if they
later decides to change their names I have a problem.
I read that I can use atomic updates to update all occurrences of the name (like in comments) but how can I do this using Alex´s library? Can I update all places where the name is wrong?
UPDATE
This is how the image document looks like with the comments.
I think that it is pretty strange that MongoDB encourage the use of subdocuments but then does not include a way to update multiple items in an array.
{
"_id": ObjectId("4e9ead773dc793dc01020000"),
"description": "An image",
"category": "accident",
"comments": [
{
"id": ObjectId("4e96bd063dc7937202000000"),
"fullname": "James Bond",
"comment": "This is a comment.",
"created_at": "2011-10-19 13:02:40"
}
],
"created_at": "2011-10-19 12:59:03"
}
Thankful for all help!
I am not familiar with codeignitor, but mb mongodb shell syntax will help you:
db.comments.update( {"Fullname":"Andrew Orsich"},
{ $set : { Fullname: "New name"} }, false, true )
Last true flag indicate that you want update multiple documents. So it is possible to update all comments in one update operation.
BTW: denormalazing (not 'hard coding') data in mongodb and nosql in general is usual operation. Also operation that require update a lot of documents usually work async. But it is up to you.
Update:
db.comments.update( {"comments.Fullname":"Andrew Orsich"},
{ $set : { comments.$.Fullname: "New name"} }, false, true )
But, above query will update full name in first comment on nested array. If you need to affect changes to more than one array element you will need to use multiple update statements.

Resources