Stormcrawler / Elasticsearch and keeping track of inbound links to a page - elasticsearch

When we are searching the results of the Stormcrawler crawl in the Elasticsearch index, people are inevitably comparing the results to Google, and the searched results are comparing unfavorably to the google search of the same topic. One of the ways Google helps to determine the rank of various pages is to track the in-bound links to any given page.
In thinking about the search results on our page, and looking through the status index, I came across the field url.path. url.path appears to contain the entire path that led to the current page.
Would it be possible to create a multivalue field in the index that gets populated with just the last url from whatever bolt/function generates the url.path. That way the field would end up being an array of all pages that are directly linking to the current document.
With that info, you could potentially count the values and get an idea of the relative popularity of the current doc by all of the pages linking to it.
Is something like that possible with Stormcrawler?

This would be possible with some modifications of the code. By default, we keep the info about a discovered URL, including the path that led to it, only for the first instance of that URL being discovered. There could be various ways of implementing this, for instance with a custom bolt accumulating the inlinks into Redis or a Graph DB.
Your underlying question is about relevance tuning with Elasticsearch. This depends of course on what fields are sent by the crawler but not only. I know of some StormCrawler users who used it with ES as a replacement for Google Search Appliance with great success. Info about inlinks could help, but you should be able to get decent results without it.

Related

How to get most retrieved items from an elasticsearch index?

I have uploaded some data to elasticsearch and I would like to keep track of how many times a data point is returned by past searches, that is to say, the most popular searched items.
Does elasticsearch provide such functionality to achieve this without implementing and updating a counter myself?
Cheers.

Elastic search pagination starting with a particular document

I am using Elastic search to show a paginated list of products in a grid view in a mobile app. Now the user can scroll through the list and click on any product to view the details.
Now the detail view also supports scrolling through the products via swipe left and right. So for the detail view, I want to fetch paginated results from elastic search starting from a particular product.
For now I am calculating the index of the product in list view and then doing the math to fetch that particular page and scroll to the index.
Is there a better way to do this?
You can use the scroll api to get paginated results for your search query.
Alternatively, you can use the search_after api which seems to perform better for large number of results but it is available only for 7.x+ elastic versions.
I'm not totally certain — I haven't tested it myself — but I think the suggestion of using the search after API is the way to go.
You need to use something like the point in time feature, which is what the search after uses. Without it, you have no guarantee that the data in the database aren't changing. If the data change, then your search result may change. If that changes, then what comes "next" also changes, and that may no longer correspond to what you want.
E.g., if you currently have 10 search results and your item of interest is at index 5, if someone adds a document that moves your point of interest to index 6, then naïvely asking for the next item would return the same thing!
The point in time feature creates a snapshot of the database at a moment in time, so you don't have to worry about new or modified documents messing things up.
As an aside, using the point in time feature at scale is probably (again, making educated guesses here) not a very good idea. Elasticsearch has to keep a mini-snapshot of the whole database (!) every time you call that for the duration.
You're probably better off limiting the number of items people can page through to something large but manageable, and then reloading a new page if someone gets to the end. If you pull 500 products initially and someone gets to the end (which seems unlikely to me a priori), you could re-issue the search paged forward 500 items, deduplicate at the boundary and no one will be the wiser.

Alternatives for real time score by popularity with elasticsearch

I would like boost a document's score by popularity. I'd like it to be as real-time as possible.
In order to meet the real time requirement, it seems I have to re-index each document each time it's popularity changes (per view). This seems highly inefficient.
An alternative is to run a batch process that periodically re-indexes documents that have been recently viewed, but this becomes less real-time, and still requires re-indexing entire documents when only one field (the popularity) has changed.
A third approach (which we have implemented) is to use a plugin to grab a document's popularity from an external source and use a script to include it in scoring. This works as well, but slows down search for large document spaces. Using rescore helps, but it only allows us to sort a subset of the documents returned.
Is there a better option (a way to add popularity to the index without reindexing the entire document or a better way to integrate external data with elastic search)?
You can try the following to have realtime popularity field.
Include a popularity field as part of your index.
Increment popularity every time a document is retrieved. You can do this using partial update scripts.
Use function score query to boost the document.
Java API:
new FunctionScoreQueryBuilder(matchQuery("canonical_name",
phrase).analyzer("standard")
.minimumShouldMatch("100%")).add(
fieldValueFactorFunction("popularityScore")
.modifier(Modifier.LOG1P).factor(2f))
.boostMode("sum"))
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/boosting-by-popularity.html
We implemented a hybrid of your second and third approach. We had an external source (in our case a DB) that stored popularity values for a doc id and all queries regarding popularity where served from there. Additionaly we had a cron that updated all documents every hour by reindexing. The reason we reindexed is because we had other analysis done on the document that needed the new popularity but technically you can only have the db as it serves all request purposes.
DB are genearly faster when it comes to number retrieval for a doc id than eelstic search/lucene/solr. Hope this helps.
I know this is a old question, but Elasticsearch has released a experimental feature where you can provide ranks per document in the search query:
https://www.elastic.co/blog/made-to-measure-how-to-use-the-ranking-evaluation-api-in-elasticsearch
Basically, if you believe that some documents will be returned from a certain search query, you can provide those documents (their ids) along with a rank (per document) in the search query. If a provided document id is within the search result, its rank will be used to boost itself.
Since you have to provide an array of document ids and their ranks in the search query, you need some way to determine (beforehand) if these documents are expected in the search result.
This feature just seems the wrong way around at first, since you need to figure out potential results before you execute the actual search. But maybe it's something. It's real time at least.
https://www.elastic.co/guide/en/elasticsearch/reference/6.7/search-rank-eval.html

How to handle pagination when the source data changes frequently

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.
A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.
Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.
You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

Resources