Nesting relationship with ElasticSearch - elasticsearch

I am trying to find revenue per actor in a movie. It is pretty straightforward, but here's an example of what I have now:
// without actor
{
"ID": 1,
"Timestamp": "2014-01-01 00:02:12",
"Title": "Great White Shark",
"Amount": 4.99
}
It is not an issue if I have, for example, 100M entries in financials and I ask for the aggregate where the title=GreatWhiteShark.
However, when I add in an Actor, the structure becomes extremely verbose, and probably increases my storage size by 10x --
{
"ID": 1,
"Timestamp": "2014-01-01 00:02:12",
"Title": "Great White Shark",
"Amount": 4.99,
"Actors": [Christopher Plummer,Andrew Garfield,Heath Ledger,
Lily Cole,Jude Law,Verne Troyer,Johnny Depp,
Tom Waits,George MacKay,Tom Holland,Saoirse Ronan,
Seymour Cassel,Sofia Milos]
}
This is so I can ask a question such as "How much money did movies with Christopher Plummer make in 2011?".
Is there a better way to do the above structure? My main concern is performance, and secondary would be storage size.

Performance should be very good, Elasticsearch will build an inverted index for actors array anyway. Querying an actor will return all associated movies instantly.
As for space reduction, you can try encoding each actor name to an integer id instead of actor slug. But you should try slug variant first as this does not destroy readability and integrations to Kibana, etc.
Your proposed structure is perfectly suited for Elasticsearch by all means.

Related

restructure elasticsearch index to allow filtering on sum of values

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

how to breakdown seearch result with elasticsearch?

I have documents in my elasticsearch that represent suppliers, each document is a supplier and each supplier have branches as well, it looks like this:
{
"id": 1,
"supplierName": "John Flower Shop",
"supplierAddress": "107 main st, Los Angeles",
"branches": [
{
"branchId": 11,
"branchName": "John Flower Shop New York",
"branchAddress": "34 5th Ave, New York"
},
{
"branchId": 12,
"branchName": "John Flower Shop Miami",
"branchAddress": "56 ragnar st, Miami"
}
]
}
currently I exposed api to allow search in fields: supplierName, supplierAddress, branchName and branchAddress.
the use case is a search box in my website, that perform a call to the backend, and pur the result in a dropdown for the user to choose the supplier.
my issue is, given the example document above, if you search for "John Flower Shop Miami", the answer will be the whole document, and what will be presented is the top level supplier name.
what I want is to present "John Flower Shop Miami", and im not sure how to understand what part of the result is what hit the search....
does someone had to do something like this before?
Handling relationship in elasticsearch is a bit of work but you can do it. I recommend you to read the ES guide's chapter handling relationships to have the big picture.
Then my advice is to index your branches as nested documents. Thus they will be stored as distinct documents in your index.
It will require you to change your query syntax to use nested queries that can be a pain in the a... but in exchange, you will be granted with inner_hits functionality.
It will allow you to know which subdocument ( nested document ) matched your query.

Dynamic Achievement System algorithm / design

I'm developing this Achievement System and it must have a CRUD, that admins access to create new achievements and it's rules. I need some help with the design & algorithm of this so it can easily evolve with new rules as admins ask.
Rules sample
Medal one: must complete 5 any courses with a score of at least 90
Medal two: must complete two specific courses with a score of at least 85
Medal three: must be top 5 in general ranking at least once
Medal four: must have more than 5000 points
I'll basically store that as metadata in a relational database, probably with these columns below:
action
action quantity
course quantity
score
id course
ranking
position
points
I want to know if there is any known algorithm / design to this kind of problem? Or perhaps I should store them differently to make it easier? Don't know, I want suggestions.
Your doubts may be right. In my opinion, a database is the wrong way to organize this data. Every new kind of achievement you want to create would add extra columns to your database, and most achievements wouldn't use most of the columns. A more flexible data structure, one that doesn't expect for every entry to use all of the possible achievement criteria at once by default, would probably be more useful. Most languages support JSON, so I suggest you use that. The structure could be something like this:
[
{
"name": "Medal One",
"requirements": {
"coursesCompleted": 5,
"scoreMin": 90
}
},
{
"name": "Medal Two",
"requirements": {
"specificCoursesCompleted": [
"Course 1",
"Course 2"
],
"scoreMin": 85
}
},
{
"name": "Medal Three",
"requirements": {
"generalRankingMin": 5
}
},
{
"name": "Medal Four",
"requirements": {
"scoreMin": 5000
}
}
]
You can see here how the criteria types are sometimes reused, but they can be omitted when not needed and new ones can be added to a few achievements without bloating the rest of the dataset as well.
PS: I made the criteria names very verbose for demonstration purposes; shortening them or not in actual use is up to preference.

Elasticsearch data size optimization

I was wondering, would it be good practice to optimize data like this in Elasticsearch?
Old data
{
"user_id": 1,
"firstname": "name",
"lastname": "name",
"email": "email"
}
New data
{
"uid": 1,
"f": "name",
"l": "name",
"e": "email"
}
Lets say, I have billions of documents with long named keys, would it save alot space if I used short named keys instead?
Or does elasticsearch compress data by default, so I don't need to worry about this?
I prefer to have data more readable, but if it could save a lot space, then its whole different thing.
This question is asked five years ago and it had only one answer, so would be nice to have more comments about this.
You can read it here:
Elasticsearch scheme optimization
Any thoughts from experienced elasticsearch developers?

How to find related related songs or artists using Freebase MQL?

I have any Freebase mid such as: /m/0mgcr, which is The Offspring.
Whats the best way to use MQL to find related artists?
Or if I have a song mid such as: /m/0l_f7f, which is Original Prankster by The Offspring.
Whats the best way to use MQL to find related songs?
So, the revised question is, given a musical artist, find all other musical artists who share all of the same genres assigned to the first artist.
MQL doesn't have any operators which can work across parts of the query tree, so this can't be done in a single query, but given that you're likely doing this from a programming language, it be done pretty simply in two steps.
First, we'll get all genres for our subject artist, sorted by the number of artists that they contain using this query (although the last part isn't strictly necessary):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count"
}]
}]
Then, using the genre with the smallest number of artists for maximum selectivity, we'll add in the other genres to make it even more specific. Here's a version of the query with the artists that match on the three most specific genres (the base genre plus two more):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count",
"limit": 1,
"a:artists": [{
"name": null,
"id": null,
"a:genre": {
"id": "/en/ska_punk"
},
"b:genre": {
"id": "/en/melodic_hardcore"
}
}]
}]
}]
Which gives us: Authority Zero, Millencolin, Michael John Burkett, NOFX, Bigwig, Huelga de Hambre, Freygolo, The Vandals
The things to note about this query are that, this fragment:
"sort": "artists.count",
"limit": 1,
limits our initial genre selection to the single genre with the fewest artists (ie Skate Punk), while the prefix notation:
"a:genre": {"id": "/en/ska_punk"},
"b:genre": {"id": "/en/melodic_hardcore"}
is to get around the JSON limitation on not having more than one key with the same name. The prefixes are ignored and just need to be unique (this is the same reason for the a:artists elsewhere in the query.
So, having worked through that whole little exercise, I'll close by saying that there are probably better ways of doing this. Instead of an absolute match, you may get better results with a scoring function that looks at % overlap for the most specific genres or some other metric. Things like common band members, collaborations, contemporaneous recording history, etc, etc, could also be factored into your scoring. Of course this is all beyond the capabilities of raw MQL and you'd probably want to load the Freebase data for the music domain (or some subset) into a graph database to run these scoring algorithms.
In point of fact, both last.fm and Google think a better list would include bands like Sum 41, blink-182, Bad Religion, Green Day, etc.

Resources