How to implement Tag search? - algorithm

I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know

Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.

ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).

Related

Elastic Search and Search Ranking Models

I am new to Elastic Search. I would like to know if the following steps are how typically people use ES to build a search engine.
Use Elastic Search to get a list of qualified documents/results based on a user's input.
Build and use a search ranking model to sort this list.
Use this sorted list as the output of the search engine to the user.
I would probably add a few steps
Think about your information model.
What kinds of documents are you indexing?
What are the important fields and what field types are they?
What fields should be shown in the search result?
All this becomes part of your mapping
Index documents
Are the underlying data changing or can you index it just once?
How are you detecting new docuemtns/deletes/updates?
This will be included in your connetors, that can be set up in multiple ways, for example using the Documents API
A bit of trial and error to sort out your ranking model
Depending on your use case, the default ranking may be enough.
have a look at the Search API to try out different ranking.
Use the search result list to present the results to the end user

Google Search Appliance sort by metadata content

I'm trying to refine the search results received by my application by including the sort parameter in my HTTP requests. I've combed through the documentation here, but I can't find exactly what I'm looking for.
I'm searching for DOC filetypes, and I am able to sort by date or sort by metadata, as in alphabetizing by title, author, etc. I can also filter by whether or not the title contains certain keywords. What I want to do is to sort by whether or not the title contains certain keywords (these documents appearing first in the results), but to still keep the other results.
For example, with keywords [winter, Christmas, holiday] I could do a descending sort by the sum of inmeta:title~winter, inmeta:title~Christmas, inmeta:title~holiday and the top result might be
Winter holidays other than Christmas
followed by documents with one or two of the keywords, followed by documents that meet the other search parameters but contain no keywords.
Is this possible in GSA?
I finally achieved what I was trying to do, so figured I'd post in case it helps anyone else.
As far as I know, it is impossible to create a query with this capability, but with Google's Custom Search API, you can create a search engine with the desired keywords in the context file (by editing the XML file directly or by adding keywords through the CSE console). Then you can formulate the query as usual, but perform the search on your personalized engine.
https://developers.google.com/custom-search/docs/ranking

Good way to exclude records in SOLR or Elasticsearch

For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.
If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).
In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.

Hierarchical faceting with Lucene/Solr/Elasticsearch where document can have multiple parents

I'm currently evaluating whether to use elasticsearch or solr in a project and moving through the cases that need to be implemented. I found one case on which I couldn't find any documentation which felt a bit strange to me since the case seemed to be quite common to me. The categories are user supplied so I don't know them in advance. Consider the following part of a taxonomy with documents that can have multiple categories:
Root (3)
Books (2)
Sci-fi (1)
DocumentA
Fantasy (2)
DocumentA
DocumentC
Movies (1)
Action (1)
DocumentB
Games (1)
Adventure
DocumentB
In this case DocumentB could be an entry for e.g. Indiana Jones. Normal term hierarchies can be implemented using the path hierarchy tokenizer in solr/elastic, so DocumentC would have 'Root/Books/Fantasy' as category with a path split on '/'.
DocumentB however would need to have two paths ('Root/Movies/Action' and 'Root/Games/Adventure'). I thought about dynamically adding one category_n field per path for the document in elastic with the path hierarchy tokenizer and then do the category search on all the category_* fields, but i don't know if that would be the right approach, especially considering that the document count for the facets is not simple because the count of a parent node is not the sum of its children (documents can be in multiple child categories and should not be counted more than once).
What would be a good way to implement this in solr/elastic?
Cheers
I ended up using ES and had a category field in which I put every path to the node. So 'Root/Movies/Action' and 'Root/Games/Adventure'. Then I used a path hierarchy tokenizer splitting on / with this field.
ES supports putting multiple paths in that field and searching them. I then used an aggregation with bucketing on the categories, that yielded exactly what I wanted, documents are not counted multiple times if the occure more than one time in a branch.

Multilanguage elastic search

I will be indexing posts in ElasticSearch. For now there are two languages: English and Chinese. So Each post has one (English) or two translations plus some data that are common for both languages. My question is how should I index posts?
Create two indices: posts-en and posts-cn and store posts separately?
Create single index posts and keep data in format like this:
{
commonParam1: 1,
commonParam2: "somevalue",
...
titleEn: "English title",
titleCn: "Chinese title",
contentEn: "Content EN",
contentCn: "Content CN",
...
}
Unless you have a compelling reason to split a single document across two indexes I'd strongly advise keeping it all in one index.
With one index you can easily use a different analyzer for each each language specific field. Adding additional mappings in the future for new languages is fairly straightforward. It allows you to index each document in a single call as opposed to two, one for each language, if you index separately. You reduce duplicated data (e.g. the common data).
I'd also take a good look at this post: http://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/
It's a good discussion on analyzing and indexing for multiple languages into Elasticsearch.

Resources