Fuzzy Arabic Search In Index - elasticsearch

I try to use Elastic-search fuzzy search feature with Arabic search queries.
more details about it is here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness
unfortunately, I get mixed results.
While sometimes I do get relevant result, which contains some errors(in this case almost all results are relevant), which are not present without the fuzzy logic.
for bad queries which usually returns few results(less than 10), I get hundreds of irrelevant ones.
Does anyone know how should I treat those queries, so whenever there is a lot of noise, it will be eliminated, and when there are a lot of relevant results they all be present? How should I tune the fuzziness, so it won't be harmful?

Related

Sort results by relevance without filtering in Algolia

Is there a way to sort results in Algolia by relevance instead of filtering them? In our case we have quite a few important attributes but we only have around 700 products so many times the search using facets end up with few or no results.
To avoid this, we are looking for a solution to reorder the list by relevance to show the best results on top while allowing users to still see the other less relevant results. Basically not filtering products, but just reorder them by relevance based on a combination of attributes we set.
Thanks
When setting filters leads you to few or no results, and you'd like to avoid that by still showing less relevant results, two solutions come to mind:
Use optionalFilters instead of filters. You get the same behavior as with filtering, but the Algolia API also returns results that don't match the filters and ranks them lower. This is the ideal solution, as it takes a single API round trip.
Perform a second search without filters when the first search returns fewer records than a threshold of your choice. This is a more manual approach and takes up to two API calls.

Elastic query for large text document match

If I have an elastic index of news articles, with the news body text in a newsBody field, can you do a search to see if another newsBody 'matches' one in the index? The other newsBody text may have slight variations however.
So not exact matching, but being able to test for similarity between large bodies of text. This is important as often news articles will be nearly identical but differ in ~30 out of 400 words.
So I'd like to be able to pass in a newsBody, and query it against the whole index, looking for similarity to any 'matches'.
I think the similarity module may help, but haven't got anywhere yet: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
Thanks,
Daniel

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

Applications of semantic web/ontology in informational retrieval?

What are the uses of Semantic Web in the information Retrieval. Semantic Web here i mean, the structured created like DBPedia, Freebase.
I have integrated information in RDF with Lucene in several projects and i think a lot of the value you can get from the integration is that you can go beyond the simple keyword search that Lucene would normally enable. That opens up possibilities for full text search over your RDF information, but also semantically enriched full-text search.
In the former case, there is no 'like' operator in SPARQL, and the regex function while similarly capable to the SQL like, is not really tractable to evaluate against a dataset of any appreciable size. However, if you're able to use lucene to do the search instead of relying on regex, you can get better scale and performance out of a single keyword search over your RDF.
In the latter case, if the query engine is integrated with the lucene text/rdf index, think LARQ, (both Jena and Stardog suppot this), you can do far more complex semantic searches over your full-text index. Queries like 'get all the genres of movies where there are at least 10 reviews and the reviews contain the phrase "two thumbs up"' That's difficult to swing with a lucene index, but becomes quite trivial in the intersection of Lucene & SPARQL.
You can use DBpedia in Information Retrieval, since it has the structured information from Wikipedia.
Since Wikipedia has knowledge of almost every topic of interest in terms of articles, categories, info-boxes that is being used in the information retrieval systems to extract the meaningful information in the form of triples i.e. Subject, Predicate & Object.
You can query the information via SPARQL using the following endpoint:Endpoint to query the information from DBpedia

Lucene Standard Analyzer vs Snowball

Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support, which sounds nice. However, I'm wondering if there are any drawbacks to gong with snowball over standard? Am I losing anything by going with it? Are there any other analyzers out there to consider?
Yes, by using a stemmer such as Snowball, you are losing information about the original form of your text. Sometimes this will be useful, sometimes not.
For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.
Whether or not this is appropriate to you depends on your content, and on the type of queries you are supporting (for example, are the searches very basic, or are users very sophisticated and using your search to accurately filter down the results). You may also want to look into less aggressive stemmers, such as KStem.
The snowball analyzer will increase your recall, because it is much more aggressive than standard analyzer. So you need to evaluate your search results to see if for your data you need to increase recall or precision.
I just finished an analyzer that performs lemmatization. That's similar to stemming, except that it uses context to determine a word's type (noun, verb, etc.) and uses that information to derive the stem. It also keeps the original form of the word in the index. Maybe my library can be of use to you. It requires Lucene Java, though, and I'm not aware of any C#/.NET lemmatizers.

Resources