Can I use my own labels for carrot2 - carrot2

I am modifying the source code of carrot for a project.
Acc. to the LINGO algorithm, it first generates the most probable labels and builds clusters which best suit the labels right?
So, can I input my own set of labels to the carrot to see how it clusters the documents around them?

Unfortunately, you can't provide your own labels for clustering with Lingo.
On the other hand, the label-to-document assignment algorithm is very simple in Lingo -- if the document contains the label's words, it will be assigned to the label. Therefore, you can achieve the same effect by, for example, indexing your documents in Lucene and then querying the index using the predefined labels you have.

Related

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

How can we design system for document search?

I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.
So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?
What's the right way we should design system for document search in distributed fashion with right components.
ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.
If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.
You may be curious or for some reason you may need to write it by yourself. So how it works :
TFIDF and cosine distance
As you specified first you will tokenize
then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.
imagine have only 3 word in your language "foo, bar, bird"
so a text with "foo bar bird" can be represented by a vector3[1,1,1]
a text with
A) "foo foo foo foo bird" will be [4,0,1]
another
B) "foo bar" [1,1,0]
if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.
Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)
TFIDF stand for term frequency inverse document frequency.
this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.
Let explain that :
the word "that, in a the" etc are everywhere so they are not important
in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
in a particular text it is cited 5 times hand has a frequencie of 0.1%
then it is quite rare in the corpus but quite important in your text in comparison
so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
I have got a little surprise for you at the end : this is what is behind
the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.
There is still reason to not use elasticsearch :
research purpose
you actually write another search engine like duckduck go (why not)
you are curious
the scaling and distribution part
Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene
it depend greatly on the volume of text to be indexed and word.
If it represent 1M of text you dont need to distribut it.
you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)
foo is in text A,B,C,R
so I will partition my index
I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file.
A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++
At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.
the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.
Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)
I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed.
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)
All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)
for exemple wikipedia this is less than 1T of text.
finally
If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".
Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.
Twitter is using Lucene for its real time search

Algorithm to recognize keywords' categories in a One-search-box-for-all model query

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

N-gram text categorization category size difference compensation

Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification

Resources