ElasticSearch suggestions on blog name, authors and tags

ElasticSearch suggestions on blog name, authors and tags - laravel

I've been looking all over the place for a good answer to my question but I just can't find any...
I'm using ElasticSearch along with Laravel. I've used ElasticSearch on another project but never used suggestions. I'm following this tutorial as I think it provides a great starting point for using Laravel with ElasticSearch: https://blog.madewithlove.be/post/how-to-integrate-your-laravel-app-with-elasticsearch/
My question is about suggestions; I want my search to be a search-as-you-type just like the one you would find on Spotify. I want my users to type a few letters in the search box and have the results be organized into multiple categories: blogs, authors, tags.
If I index my data into one index, with authors and tags being blog's nested objects, I can easily get suggestions using the completion suggester for blog names, but not for nested objects. I could also split each model and index data separately into different indexes, but that would mean I would have to make 3 queries to get my results back.
Am I doing something wrong? Should I structure my data differently? Is making 3 queries the way to go or is there a way to have a single query output search results from different indexes?
Thanks!
Xavier

Something that I did when I built a search-as-you-type was I used a separate index for suggestions. In your situation, you'd index the name (title, author, whatever) in one field and the type in another. Then you could search on one field and display the grouped results.
The advantage here is speed. This will likely be a heck of a lot faster than trying to do a suggester on your nested data. (Which you can probably do, but I'm not sure how.) And speed is pretty important for this type of feature.

Related

How to calculate relevance in Elasticsearch based on associated documents

Main question:
I have one data type, let's call them People, and another associated data type, let's say Reports. A person can have many reports associated with them and vice versa in our relational database. These reports can be pretty long, sometimes over 1000 words mostly in English.
We want to be able to search people by a keyword, where the results are the people whose reports are most relevant to the keyword. For example if Person A's reports mention "art director" a lot more than any other person, we want Person A to show up high in the search results if someone searched "art director".
More details:
The key thing here, is that we don't want to combine all the reports together and add them as a field for the Person model. I think that with 100,000s of our People records and 1,000,000s of long reports, this would make the index super big. (And I think there might be limits on how long the text of a field can be.)
The reports are also indexed on their own, so that people can do full text searches of all the reports without considering the People records. This works great already.
To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
Is this possible, and if so how?
P.S. I am using the Searchkick Ruby gem to generate the Elasticsearch queries through an API. But I can also use the Elasticsearch DSL directly if necessary.

Answering to your questions.
1.(...) we want Person A to show up high in the search results if someone searched "art director".
That's exactly what Elasticsearch does, so I would recommend you to start with a simple match query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
From there you can start adding up more complexity.
Elasticsearch uses TF-IDF which means:
TF(Term Frequency): The most frequent a term is within a document, more relevant it is.
IDF(Inverse Document Frequency): The most frequent a term is across the entire dataset the less relevant it is.
2.(...) To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
You are right. The recommendation is not indexing a book as a field, but index the different chapter/pages/etc.. as documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html
There are some structures you can use. Which one to use will depend on how big is the scale of your data, en how do you want to show this data to your users.
The structures are:
Joined field type (parent=author child=report pages)
Nested field type (array of report pages within an author)
Collapsed results (each doc being a book page, collapse by author)
We can discuss a lot about the best one, but I invite you to try yourself.
Some guidelines:
If the number of reports outnumber for a lot to the author you can use joined field type.

Elasticsearch - Autocomplete return word/term/token suggestions instead of whole documents

I am trying to implement a simple auto completion for query terms.
There are many different approaches but most of them do return documents instead of terms
- or the authors simply stopped explaining from that point and i am not able to adapt.
A user is typing in a query - e.g. phil
What i want is to provide a list of term completion suggestions like philipp, philius, philadelphia, ...
I am able to get document matches via (edge)ngrams, phrase_prefix and so on but i am am stuck at retrieving matching terms (completion suggestions).
Can someone give me a hint?
I have documents like this {"title":"...", "description":"...", "content":"..."}
All fields have larger string values but especially the field content contains fulltext content.
I do not want to suggest the whole title of a document containing e.g. Philadelphia. Just the word "Philadelphia".

Looking for something like that, myself.
In SOLR it was relatively simple to configure (although a pain to build and keep up-to-date) using solr.SpellCheckComponent. Somehow the same underlying Lucene functionality is used differently between SOLR and ElasticSearch, and in ElasticSearch it is geared towards finding whole documents (or whole field values, if you will) or so it seems...
Despite the profusion of "elasticsearch autocomplete" articles, none appears to deal with this particular issue. Like it doesn't exist. Maybe their use case is different and ElasticSearch works for them just fine, who knows?
At this point I think that preparing the exact field values to use with ElasticSearch autocomplete (yes, that's the input field values, not analyzer tokens) maybe the only way to solve the problem. Which is terrible, because the performance is going to be very low.

Try term suggester:
The term suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested
terms are provided per analyzed suggest text token. The term suggester
doesn’t take the query into account that is part of request.

ElasticSearch Wrapping Head on Index Types

I'm looking into elastic search right now and I am having a hard time grasping how index types fit into the data model, I've read examples and documentation but none really goes in depth or the examples seem to use a data model that is composed of several submodels.
I am currently using mongodb to store my data, let's take this example of an Article collection that I want to be indexed for search, my doc looks like this:
Article = {
title: String,
publisher: String,
subject: String,
description: String,
year: Integer,
}
Now I want each of those fields to be searchable, so I would make an elasticsearch index of 'Article'. I will need to define each field and how it should be analysed and whether it is stored or not, that I understand.
Now how does an index type come in here? As far as I am aware, Lucene does not have this concept, this is a layer added by Elasticsearch.
For example, maybe some of you may say that we can logically group the documents by subject or publisher and create index types on those but how is this different from searching by subject or publisher?
Is it more of a performance related aspect that we have index types?

Not a very easy question to answer, but I am going to give it a try. But be warned this just my opinion.
First of all, if you do not want to keep certain documents together in an index, just because it feels they should, create separate indices. There is not really a penalty for using more indices over more types. The only thing I can think of is that you could create analysers and mappings that you can reuse over the different types.
You can use types if you feel documents belong together, they have similar structure but not necessary the same structure. Be warned though, do not create separate mappings for fields with the same name in different types within the same index. Lucene does not like this.
Than there is the final scenario, in parent-child relationships, here you need types. This way the parent and it's children can be places in the same shard which is better for performance.
Hope that helps a bit.

If I'm not mistaken, the catch with using more than one data type in one index is almost identical to using different indices. Say, you can store (as I did) documents of types "simple_address", "delivery_address", "some_strange_but_official_address_info" in the same index "address" to make your code a bit more sane. But if you don't use parent-child links, it's equivalent to just having three indices.
Speaking of your example, you should wrap your head around what would you like to search. If, for instance, you add comments in equation, it's better to use some kind of separation - either as parent-child or different indices with manual mapping by keys. And, obviously, you should have different mappings for "Article" and "Comment" types.

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.

Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...

I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks

What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

apache cassandra query/full text search

I've been playing around with apache's cassandra project. Done a fair bit of readin and i have some fairly complex examples that i've done, including inserting single and batch sets of data, retrieving a single and multiple data sets based on keys.
Some of the articles i've looked at include
http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example
http://github.com/digg/lazyboy
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://www.sodeso.nl/?p=80
I've got a fairly good grasp of the concepts explained and have even implemented a simple app.
None of the articles describe how one would go about performing a query where, for eg, the query is a search term a user has typed in.
Does anyone know how or can suggest how i'd go about performing such a query?
Or perhaps a way to create a searchable index, full text search or anything even remotely close?

You will probably split text into words, and than use these words as keys to your "index". Each word will contain timestamp ordered column family with list of IDs to your articles, messages etc. So you can only perform simple searches over keys (words).
When searching more than one word, use intersection over these column families.
This is very simple approach, if you need more complex queries look at Lucandra - http://github.com/tjake/Lucandra - Lucandra is a fulltext search engine with Cassandra as backend storage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio