Limiting Data With Lucene.NET - full-text-search

We are using Sql Server 2012 Full-Text indexing, however we would like to move our database to Sql Azure. Using the migration tool it is telling us that Full-Text indexing is not compatible with Sql Azure (even v12 which is in preview does not support it so it doesn't look like they intend to support it).
Because of this we are looking at alternatives and the best I have found so far is using Lucene.NET with AzureDirectory (https://azuredirectory.codeplex.com). This will allow us to store the index in blob storage and cache it locally on the file system of the VMs which host the web sites (also in Azure).
The issue we have is that the data we intend to index is items such as news stories which are not visible to all users because of a publishing model we have which limits items to be visible to just a subset of the users. With Full-Text indexing when searching for a news story we can limit the data for the using searching using a simply join on what is visible to them, however with Lucene we will not be able to do this.
The idea we have come up with is to take news stories in the index with a collection of UserIds that are allowed to view that news story, I am afraid I am very new to Lucene and I cannot work out the best way to do this, we are adding the index for a news story like so:
document.Add(new Field("Title",
news.Title,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.NO));
document.Add(new Field("Content",
news.Content,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.NO));
However if we have a collection of userIds defined as
IEnumerable<int>
How could we add these to the news story index and then search on them effectively for a given user id. Additionally what will the performance hit be if we are adding 100s or 1000s of UserIds against a lucene document. Is there a better way to go than down this road as this might be a terrible idea (probably is a terrible idea)?

I also ran into that problem while migrating to Azure and have ended up with that same permissions model. Since your userIds are integers and won't have special characters then you can rely on many of the Lucene(.net) analyzers like StandardAnalyzer and WhitespaceAnalyzer to split a list of IDs into terms as long as you input a string. Just separate each ID with a space or comma depending on what the Analyzer will split on.
You should be able to do something simple like this to index the IDs...
IEnumerable<int> userIds = new int[] { 123, 456, 789 };
document.Add(new Field("AllowedUserIDs",
String.Join(" ", userIds),
Field.Store.NO,
Field.Index.ANALYZED_NO_NORMS));
Then just make sure to query with a TermQuery to match whole terms(IDs). Something like...
int currentUserID = 123;
string queryString = "airplane";
BooleanQuery query = new BooleanQuery();
query.Add(new TermQuery(new Term("AllowedUserIDs", currentUserID.ToString())), Occur.MUST);
query.Add(new TermQuery(new Term("Title", queryString)), Occur.SHOULD);
query.Add(new TermQuery(new Term("Content", queryString)), Occur.SHOULD);
I can't speak very specifically to the performance concerns but we have a few hundred IDs in our lists and it hasn't seemed to impact the query times since we added it. Really it's no different than searching on a few hundred or few thousand word news article.

Related

Good way to exclude records in SOLR or Elasticsearch

For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.
If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).
In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.

Sorting by a non-key (arbitrary) field in CouchDB

I have a fairly large CouchDB database (approximately 3 million documents). I have various view functions returning slices of the data that can't be modified (or at least, should only be modified as a last resort).
I need the ability to sort on an arbitrary field for reporting purposes. For smaller DBs, I return the entire object, json_parse it in our PHP backend, then sort there. However, we're often getting Out Of Memory errors when doing this on our largest DBs.
After some research, I'm leaning towards accessing a sort key (via URL parameter) in a list function and doing the sort there. This is an idea I've stolen from here. Excerpt:
function(head, req) {
var row
var rows=[]
while(row = getRow()) {
rows.push(row)
}
rows.sort(function(a,b) {
return b.value-a.value
})
send(JSON.stringify({"rows" : rows}))
}
It seems to be working for smaller DBs, but it still needs a lot of work to be production ready.
Is this:
a) a good solution?
b) going to work with 3, 5, or 10 million rows?
You can't avoid loading everything into memory by using a list function. So with enough data, eventually, you'll get an out of memory error, just as you're getting with PHP.
If you can live within the memory constrains, it's a reasonable solution, with some advantages.
Otherwise, investigate using something like lucene, elasticsearch, or Cloudant Search (clouseau & dreyfus).
In our environment, we have more than 5 million records. The couch is design such that each and every Document has some specific fields which distinguish it from the other category of documents.
For example, there are number documents with field DocumentType "USer" or DocumentType "XXX"
These DocumentType field allow us to sort various document based on different categories.
So if you have 3 Million doc, and you have around 10 categories so each category will have about 300k Docs.
Now you can design system such that you always pass the DocId you need to be passed to Couch. In that way it will be faster.
so query can be like
function(doc)
{
if(doc.DocumentType=== 'XXX' && doc._id) {emit(doc.FieldYouWant, doc._id)}
}
This is how our backhand is designed in production.

Elasticsearch with multiple search criteria

I am try to build a full text search engine using elasticsearch. We have a application which has conferences running across the globe. We have the future and past conferences data. For a POC we have already loaded the conferences details into elasticsearch and it contains fields like title,date,venue,geo_location of the venue as document.
I am able to do simple search using match all query. And also by using function_score I can get the current on going conferences and also using user geo location i can get nearby conferences to users location.
But there are some uses cases where i got stuck and could not proceed. Use cases are.
1) If user try to search with "title + location" then I should not use the user current geo location rather whatever user has provided the city_name use that place geo location and retrieve those doc. Here I know some programming is also required.
2) User search with "title + year", for ex. cardio 2014. User interested to see all the caridology conf of 2014 and it should retrieve that year documents only. But using function score it is retrieving the current years documents.
First of all let me know that above two use cases can be handled in single query. I am thinking to handle it one request, but got stuck.
A proper solution would require you to write your own query parser in your application (outside of elasticsearch) that will parse the query and extract dates, locations, etc. Once all features are extracted, the parser should generate a bool query where each feature would become an appropriate must clause. So, the date would became a range query, the location - geo_location query and everything else would go into a match query for full text matching. Then this query can be sent to elasticsearch.

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

How to implement Tag search?

I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know
Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.
ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).

Resources