I have table of movies (movie_id, title), one movie can have many titles (different languages).
I would like to implement full-text search by all titles, then movies with the same relevance should be ordered by date. Now I’m using sphinx and doing this:
sql_joined_field = all_movie_titles from query; select movie_id as id, title from tbl_movie_titles order by movie_id
It’s the only field which used for search
As I understand, in this way sphinx search matches of keyword in each title of one movie, but some movies have 2 titles, while another, for example, 10. Due to keywords often duplicates in different titles of one movie, sphinx calculate result relevance weight depends on matches in all titles of one movie. Because of this, two movies, which should have the same relevance have different weights. I’ve tried to use different rankers, but anyway results are bad. How can I make sphinx to calculate weight for each title of one movie independently and then take the highest?
If this task can be solved easier by another search engine, like elasticsearch tell me.
Thanks
You've effectively created a field that just contains all the titles concatenated as one long string (the 'joined' in the definition)
So multi-titled movie, will have the words multiple times, which as you say can affect ranking.
You seem to be currently setup to have your sphinx document as a movie. ie one document per movie (regardless of what data you have for the movie)
One options would be to change to instead have one document per title (ie movie/language combination), then the ranking will be 'within' the one language.
Because you (presumably) only want one result per movie, can use the query time GROUP BY option. (which means making sure you have movie_id as an attribute)
Related
Declaimer: Possible duplicate of this SO question, not sure...
Let's assume I have something similar to IMDB (e.g. catalog of movies) and I want to store it in Elasticsearch.
Single Movie record contains Title, Description, and Categories (strings, e.g. "Children", "Action", etc).
Let's assume that users allowed to search a free text, which can be everything: words from title, from description or from categories (e.g. "movie for children").
I wondering, from search performance perspective, what is more efficient: to query on each of the fields, or to create a special big field which is a concatenation of all of the fields and then to query only on it.
I'm currently evaluating whether to use elasticsearch or solr in a project and moving through the cases that need to be implemented. I found one case on which I couldn't find any documentation which felt a bit strange to me since the case seemed to be quite common to me. The categories are user supplied so I don't know them in advance. Consider the following part of a taxonomy with documents that can have multiple categories:
Root (3)
Books (2)
Sci-fi (1)
DocumentA
Fantasy (2)
DocumentA
DocumentC
Movies (1)
Action (1)
DocumentB
Games (1)
Adventure
DocumentB
In this case DocumentB could be an entry for e.g. Indiana Jones. Normal term hierarchies can be implemented using the path hierarchy tokenizer in solr/elastic, so DocumentC would have 'Root/Books/Fantasy' as category with a path split on '/'.
DocumentB however would need to have two paths ('Root/Movies/Action' and 'Root/Games/Adventure'). I thought about dynamically adding one category_n field per path for the document in elastic with the path hierarchy tokenizer and then do the category search on all the category_* fields, but i don't know if that would be the right approach, especially considering that the document count for the facets is not simple because the count of a parent node is not the sum of its children (documents can be in multiple child categories and should not be counted more than once).
What would be a good way to implement this in solr/elastic?
Cheers
I ended up using ES and had a category field in which I put every path to the node. So 'Root/Movies/Action' and 'Root/Games/Adventure'. Then I used a path hierarchy tokenizer splitting on / with this field.
ES supports putting multiple paths in that field and searching them. I then used an aggregation with bucketing on the categories, that yielded exactly what I wanted, documents are not counted multiple times if the occure more than one time in a branch.
I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know
Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.
ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).
I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?
You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search
http://en.wikipedia.org/wiki/Prospective_search
I have within my SOLR index song objects which belong to a higher level album object. An example is shown below:
<song>
<album title>Blood Sugar Sex Magic</album title>
<song title>Under the Bridge</song title>
<description>A sad song about junkies</description>
</song>
What I can do at the moment is create a facet on the album title so that a search on songs will also show me what albums contain hits for that keyword.
The default behaviour for SOLR is that the facets are shown in the order of most hits to least. However what I want to achieve is the facet list to be sorted according to the relevancy of the top hit for that album.
For example a search on the word "sad" may show a facet with one hit for "Blood Sugar Sex Magic" and there may also be an album called "Sad Clown songs" where there are 10 hits. "Sad clown songs" will show as the first facet even though it may be that "Under the bridge" comes up as the most relevant song.
My question is how can I get all the facets back but then have them ordered by the relevancy of the songs within them? If I would need to change or extend some underlying SOLR code what would that be?
Thanks in advance.
Solr can only sort facets in lexicographical order or by count (see the facet.sort parameter).
If you want to implement a different sort order I'd start in the SimpleFacets class.
In the end, we decided the easiest way to do this without needing to modify SOLR source code, would be to query solr, ask for the facets then iterate through the results.
Not ideal, but works for now.
You could use Edismax to perform your search query, and use result grouping to group by a specific field, in your case you mentioned Album Title.
https://lucene.apache.org/solr/guide/7_0/result-grouping.html
https://lucene.apache.org/solr/guide/7_0/the-extended-dismax-query-parser.html