I'm implementing end user search for an e-commerce website. The catalogue contains images, text, prices of different items. LLMs are all the hype at the moment but I'm not sure how well proven the performance is in comparison to keyword based for e-commerce.
I've tried tensor based search and it appears to perform well but its hard to benchmark search against relevance so I'm not sure of putting it into production.
What frameworks are people using to determine when you use tensor/vector based search vs keyword based?
LibShortText is an open source tool for short-text classification and analysis.
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.
Who knows the answer? Thank you in advance.
I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.
According to this paper, it has internal pre-processing methods to extract features.
libshorttext.converter: For given short texts, LibShortText follows
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming
(optional), and stop-word removal (optional). The library also allows
users to choose between unigram and bigram features.
However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.
I am currently trying to figure out analysis schemes for my ElasticSearch cluster. I am using ES to index pdf, word, powerpoint and excel documents. I am using Apache Tika to extract the text.
My problem is that I do not know before hand what languages to expect the file contents to be. They could be written in any language.
My question is, is there a way to make ES analyze text regardless of the language? Or should I have a pre-defined field for each language with its own tokenizer, analyzer and stopwords?
I suggest taking a look at the ElasticSearch plugin elasticsearch-mapper-attachments. I used it to build document search functionality.
When it comes to supporting multiple languages, we have had the best experience with one index per language. If you can identify the language before indexing you can insert the document into the appropriate index. This makes it easier to add new languages vs. a field per language approach.
One thing to remember is the Don't use Types for Languages note at the bottom of one language per document page. Doing that can mess up search in a very difficult to debug way.
If you need to detect the language, there are two options mentioned at the bottom of the Pitfalls of Mixing Languages page.
Let's say I have a big corpus (for example in english or an arbitrary language), and I want to perform some semantic search on it.
For example I have the query:
"Be careful: [art] armada of [sg] is coming to [do sg]!"
And the corpus contains the following sentence:
"Be careful: an armada of alien ships is coming to destroy our planet!"
It can be seen that my query string could contain "semantic placeholders", such as:
[art] - some placeholder for articles (for example a / an in English)
[sg], [do sg] - some placeholders for NPs and VPs (subjects and predicates)
I would like to develop a library which would be capable to handle these queries efficiently.
I suspect that some kind of POS-tagging would be necessary for parsing the text, but because I don't want to fully reimplement an already existing full-text search engine to make it work, I'm considering that how could I integrate this behaviour into a search engine like Lucene?
I know there are SpanQueries which could behave similarly in some cases, but as I can see, Lucene doesn't do any semantic stuff with stored texts.
It is possible to implement a behavior like this? Or do I have to write an own search engine?
With Lucene, you could add additional tokens to a single item in a TokenStream, but I wouldn't know how to deal with tags that span more than one word.
Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org