Insights into Word/Document Embeddings - gensim

How to get insights into my created word or document embeddings?
For example if I extract features with the TF-IDF Vectorizer, I can output the top n best features. Is there a similar approach where I can gain knowledge about the model?

Related

Semantic Similarity using Elastic Search

I went through certain blogs that say Universal Sentence Encoder is used in elastic search fro semantic similarity , can we use BERT instead of ULSE , they also say the embedding search has to go through all the documents. can it be optimised.
https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
Sure - you can use BERT. Yet, it will induce higher runtime for transforming the data into vector embeddings. Btw, you should explore other similarity search alternatives, such as pinecone.io, which offers a managed vector search service.
Absolutely! You'll just have to make use of dense_vectors in order to search for vectors, which is what BERT works with.
For more information on dense vectors:
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
For more information on how to optimize embeddings search, you can check out https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

Sentiment towards a keyword

I have been looking around for Sentiment and text analysis services but most of them seem to analyse the whole text and provide one result for it.
Is there a way of analysing the same piece of text against two different keywords? For example, the same article could be talking about two entities, positively towards one and negatively towards the other.
How could one get these two sentiments within the same text? Is there a service or API already for that?
I have found IBM's AlchemyAPI but doesn't seem to return accurate results...
What you want is aspect-based sentiment analysis. There are lots of algorithms with different precisions and recalls for this aspect-based sentiment analysis.
You can use Aylien's text analysis api.

Keyword suggestion Algorithm

I have been working on a project which asks me to give keyword/keyphrase suggestion based on description of the product.
What I have currently: Description of the Product, Category of product(May or may not be present).
What I want: Machine generated keywords/keyphrases based on description.
What research I have done: (NLP Based approach) This problem can be broken down into two separate approaches.
Not using the past Data : Just summarizing on current description
Method: - Tokenization, stemming, stopwords removal etc. (Preprocessing)
Shallow NLP (Constituency Parsing) and retain only NP & JJ phrases.
This would be an approach which doesn't use description present in database.
What I was looking for is a better approach which uses ML algorithms and also uses my past product description data.
I was thinking about applying shallow parsing on entire dataset, and then give keywords which encounters in more than N number of products.
What algorithm or approach would come in handy?
How can I use my data?
Try to look at basic models Like: Term Frequency or TF-IDF, This give you some important words: https://en.wikipedia.org/wiki/Tf%E2%80%93idf,
Then search for text clustering(For cluster text into group that are related to each other) and Topic detection approaches(this can help you find prominent words and topic related to a document)
Then you can find keyword for each cluster(also you can consider categories of documents), and try to find most relevant words to another words
I suggest read some/or whole chapters of this book: http://nlp.stanford.edu/IR-book/https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Unsupervised automatic tagging algorithms?

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.
I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

Search ranking/relevance algorithms

When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
An Introduction to Information Retrieval
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Then regress that score to predict the score for an unknwon search result and sort it.
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search

Resources