Detecting changes when comparing documents within an index in ElasticSearch - elasticsearch

I'm using elastic search to store website crawl data in one index. Docs look something like this:
{"crawl_id": 1, url": "http://www.example.com", "status": 200}
{"crawl_id": 1, url": "http://www.example.com/test", "status": 200}
{"crawl_id": 2, url": "http://www.example.com", "status": 200}
{"crawl_id": 2, url": "http://www.example.com/test", "status": 500}
How would I compare 2 different crawls? For instance
I want to know which pages have changed their status code from 200 to 500, in crawl_id 2 when I compare crawl_id 2 with crawl_id 1.
I'd like to get the list of documents, but also aggregate on those results.
For instance 1 page changed from 200 to 500.
Any ideas?

I would use parent/child documents for that. Parents representing each URL, children representing each different crawl event. Then I'd select parents by searching the children (I ignore if this feature is still maintained or if it has changed its name to join data types).
I'd have also have a look to document versions and see which one fits my requirements better.

Related

Elasticsearch collapse by a field A, with top hit by field B, with results sorted by field C

Suppose I have these documents in elasticsearch:
{"name":"alpha", "grp":1, "priority": 1}
{"name":"beta", "grp":1, "priority": 3}
{"name":"gamma", "grp":2, "priority": 5}
{"name":"zeta", "grp":2, "priority": 1}
I want to query my index and get a single hit per grp.
The hit of a grp must be the document with highest priority value.
My overall query needs to return all fields, and be sorted by name.
Sample output:
{"name":"beta", "grp":1, "priority": 3}
{"name":"gamma", "grp":2, "priority": 5}
Query collapse doesn't seem to do the trick as I would need to sort by priority rather than name.
The collapsing is done by selecting only the top sorted document per collapse key
https://www.elastic.co/guide/en/elasticsearch/reference/current/collapse-search-results.html
I feel like there must be some combination of aggregations that will get the result I'm looking for, but I'm bashing my head into a wall. Please help me!?
There is no way to achieve it using collapse (yet), you can see the current progress here: https://github.com/elastic/elasticsearch/issues/45646

restructure elasticsearch index to allow filtering on sum of values

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

Joining two indexes in Elastic Search like a table join

I am relatively new to this elastic search. So, I have an index called post which contain documents like this:
{
"id": 1,
"link": "https:www.instagram.com/p/XXXXX/",
"profile_id": 11,
"like_count": 100,
"comment_count": 12
}
I have another index called profile which contain documents like this:
{
"id": 11,
"username": "superman",
"name": "Superman",
"followers": 12312
}
So, as you guys can see, I have all profiles data under the index called profile and all posts data under the index called post. The "profile_id" present in the post document is linked with the "id" present in the profile document.
Is there any way, when I am querying the post index and filtering out the post documents the profile data will also appear along with the post document based on the "profile_id" present in the post document? Or somehow fetch the both data doing a multi-index search?
Thank you guys in advance, any help will be appreciated.
For the sake of performance, Elasticsearch encourages you to denormalize your data and model your documents accordingly to the responses you wish to get from your queries. However, in your case, I would suggest defining the relation post-profile by using a Join datatype (link to Elastic documentation) and using the parent-join queries to run your searches (link to Elastic documentation).

Boosting on facet count

is it possible to boost a document based on how many of the 'same kind' are in the current search result in a Solr/Lucene query?
An example:
I'm looking at 'red dress' and this is the current situation on the facet counts:
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"sku_fashion": [
"children",
994,
"home",
9,
"men",
245,
"women-apparel",
2582,
"women-jewelry-access",
3,
"women-shoes-handbags",
2
]
}
For this user a personalisation signal is going to make me blindly boost all the items in the men fashion but looks like they are not worth of being pushed up given that they are less than 8% of the entire result set (they are probably junk that is better not to show to the user).
The problem is that I have no idea how to access this info from the function query I use to re-score the documents based on the personalisation signals.
Ideally, I would love to access the above info and kill the personalisation signal telling me to boost the men fashion.
Any idea?
Best
Ugo

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Resources