Getting Sentiments Using Elastic Search - elasticsearch

I have a big dataset of Negative/Positive/Neutral sentiments in different languages. I have one php api (based on Naive Bayes algorithm) to get sentiments based on the give dataset. There are two issues :-
1) It is extremely slow
2) It is not working properly for non English languages such as Chinese and Vietnamese
I am looking this solution in Elastic Search because ES is good at language handling and also it can handle huge dataset and we can use inbuilt tokenizers of ES.
Is there any way to implement sentiment engine based on Naive Bayes algorithm in Elastic Search?

Related

Semantic Similarity using Elastic Search

I went through certain blogs that say Universal Sentence Encoder is used in elastic search fro semantic similarity , can we use BERT instead of ULSE , they also say the embedding search has to go through all the documents. can it be optimised.
https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
Sure - you can use BERT. Yet, it will induce higher runtime for transforming the data into vector embeddings. Btw, you should explore other similarity search alternatives, such as pinecone.io, which offers a managed vector search service.
Absolutely! You'll just have to make use of dense_vectors in order to search for vectors, which is what BERT works with.
For more information on dense vectors:
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
For more information on how to optimize embeddings search, you can check out https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

new approach of indexing in Elasticsearch

I want to define a new approach of indexing in Elasticsearch so i will edit tf idf method .
where to find TF-IDF elasticsearch implementation?
what are the packages in elasticsearch source code that i need to manipulate to implement the new approach?
The TF/IDF similarity algorithm is implemented in Lucene, however, there are ways to define another similarity algorithm to be used inside Elasticsearch via the similarity module. In addition to TF/IDF, there are currently 7 more similarities supported:
BM25
Classic similarity
DFR similarity
DFI similarity
IB similarity
LM Dirichlet similarity
LM Jelinek Mercer similarity
Each of them has different parameters that you can tune. Maybe it'd be a good idea to test each of them before venturing into creating your own.
More info about the available Lucene similarity algorithms: https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/similarities/Similarity.html

How to suggest travel destinations to user based on a user query?

I am trying to code up a mini information retrieval system for suggesting travel destinations to users based upon the indexing I do on the data corpus, and I am using solr for coding up this search functionality, what all things should I take into consideration to approach this job, what all parameters should I be considering as significant in the query?
Your question is too wild to answer. There are lots of concept included in searching and make recommendation (not only for travel scope). Since I have developed the project that including searching and making recommendation feature, I give you some hints.
Search
What and how to index your content(searching by what), it affect the accuracy of searching.
Search relevance measure, you can measure by term weighting or string similarity algorithm.
Search result sorting, this is based on search relevance measure.
Recommendation
You can study Collective Intelligence to get full concepts of recommendation.
Learning process to update the recommendation algorithm. (machine learning or artificial intelligence).

Applications of semantic web/ontology in informational retrieval?

What are the uses of Semantic Web in the information Retrieval. Semantic Web here i mean, the structured created like DBPedia, Freebase.
I have integrated information in RDF with Lucene in several projects and i think a lot of the value you can get from the integration is that you can go beyond the simple keyword search that Lucene would normally enable. That opens up possibilities for full text search over your RDF information, but also semantically enriched full-text search.
In the former case, there is no 'like' operator in SPARQL, and the regex function while similarly capable to the SQL like, is not really tractable to evaluate against a dataset of any appreciable size. However, if you're able to use lucene to do the search instead of relying on regex, you can get better scale and performance out of a single keyword search over your RDF.
In the latter case, if the query engine is integrated with the lucene text/rdf index, think LARQ, (both Jena and Stardog suppot this), you can do far more complex semantic searches over your full-text index. Queries like 'get all the genres of movies where there are at least 10 reviews and the reviews contain the phrase "two thumbs up"' That's difficult to swing with a lucene index, but becomes quite trivial in the intersection of Lucene & SPARQL.
You can use DBpedia in Information Retrieval, since it has the structured information from Wikipedia.
Since Wikipedia has knowledge of almost every topic of interest in terms of articles, categories, info-boxes that is being used in the information retrieval systems to extract the meaningful information in the form of triples i.e. Subject, Predicate & Object.
You can query the information via SPARQL using the following endpoint:Endpoint to query the information from DBpedia

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Resources