Best way to correlation coefficient foe nominal data similarity - cluster-computing

I hope someone can help me on this one (PLEASE) :
I want to do similarity between some article features ( author, category, year, impact factor , citation)
And I dont have a clue how to do it for the nominal data , for the numerical features I can do the cosine similarity but how can I do it for the nominal ?
Thanks in advance for everybody !

While I don't want to recommend this approach, it seems to be very popular:
encode your categories as binary attributes. i.e.:
A1=Car -> (1,0,0)
A1=Truck -> (0,1,0)
A1=Bike -> (0,0,1)
then you can continue as you would with text. This is effectively the same as treating them as three different words.
It will work, but IMHO there is just no notion of "correlation" outside of continuous numerical values. Already on text it is more of a hack to make things than a good approach.

Related

Relation between two texts with different tags

I'm currently having a problem with the conception of an algorithm.
I want to create a WYSIWYG editor that goes along the current [bbcode] editor I have.
To do that, I use a div with contenteditable set to true for the WYSIWYG editor and a textarea containing the associated bbcode. Until there, no problem. But my concern is that if a user wants to add a tag (for example, the [b] tag), I need to know where they want to include it.
For that, I need to know exactly where in the bbcode I should insert the tags. I thought of comparing the two texts (one with html tags like <span>, the other with bbcode tags like [b]), and that's where I'm struggling.
I did some research but couldn't find anything that would help me, or I did not understand it correctly (maybe did I do a wrong research). What I could find is the Jaccard index, but I don't really know how to make it work correctly.
I also thought of another alternative. I could just take the code in the WYSIWYG editor before the cursor location, and split it every time I encounter a html tag. That way, I can, in the bbcode editor, search for the first occurrence, then search for the second occurrence starting at the last index found, and so on until I reach the place where the cursor is pointing at.
I'm not sure if it would work, and I find that solution a bit dirty. Am I totally wrong or should I do it this way?
Thanks for the help.
A popular way of determining what is the level of the similarity between the two texts is computing the mentioned Jaccard similarity. Citing Wikipedia:
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
If you have a large number of texts though, computing the full Jaccard index of every possible combination of two texts is super computationally expensive. There is another way to approximate this index that is called minhashing. What it does is use several (e.g. 100) independent hash functions to create a signature and it repeats this procedure many times. This whole process has a nice property that the probability (over all permutations) that T1 = T2 is the same as J(A,B).
Another way to cluster similar texts (or any other data) together is to use Locality Sensitive Hashing which by itself is an approximation of what KNN does, and is usually worse than that, but is definitely faster to compute. The basic idea is to project the data into low-dimensional binary space (that is, each data point is mapped to a N-bit vector, the hash key). Each hash function h must satisfy the sensitive hashing property prob[h(x)=h(y)]=sim(x,y) where sim(x,y) in [0,1] is the similarity function of interest. For dots products it can be visualized as follows:
we can now ask what would be the has of the indicated point (in this case it's 101) and everything that is close to this point has the same hash.
EDIT to answer the comment
No, you asked about the text similarity and so I answered that. You basically ask how can you predict the position of the character in text 2. It depends on whether you analyze the writer's style or just pure syntax. In any of those two cases, IMHO you need some sort of statistics that will tell where it is likely for this character to occur given all the other data/text. You can go with n-grams, RNNs, LSTMs, Markov Chains or any other form of sequential data analysis.

How does "Addressing missing data" help KNN function better?

Source:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
This page has a section quoting the following passage:-
Best Prepare Data for KNN
Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian
distribution.
Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.
Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.
Please, can someone explain the Second point, i.e. Address Missing Data, in detail?
Missing data in this context means that some samples do not have all the existing features.
For example:
Suppose you have a database with age and height for a group of individuals.
This would mean that for some persons either the height or the age is missing.
Now, why this affects KNN?
Given a test sample
KNN finds the samples that are closer to it (Aka: the students with similar age and height).
KNN does this to make some inference about the test sample based on its nearest neighbors.
If you want to find these neighbors you must be able to compute the distance between samples. To compute the distance between 2 samples you must have all the features for these 2 samples.
If some of them are missing you won't be able to compute distance.
So implicitly you would be lossing the samples with missing data

Find probability of the given data set, what probability i can say it is bad

I have a issue where there is a data set. and in there i have good and bad category, and in that category there are few elements that can be good and bad....
you can see the ven diagram i attached to get a view and the data set i have. please ill be really glad if you could help me out.
I am really new to probability and math stuff, yet i have a project to do where in the middle i have to find a way to say the given data set is bad or good depending on the data.
what probability theory can i use?
How to use... please give an an example using my data set. thankyou
Eg. if i get a data set of A,D,E elements are there... what probability i can say it is bad.
A function which gives a good / bad result is called a classification function. For any data set, there are many ways to construct a classification function. See, for example, "Pattern Recognition and Machine Learning" by Brian Ripley.
One way which is easy to understand is the so-called quadratic discriminant. It is easy to describe: (1) construct a Gaussian density for each category (good, bad, etc). (2) output the category for which a new input has the greatest probability.
(1) just compute the mean and covariance matrix for the data in each category. That gives you p(x | category).
(2) choose category such that p(category | x) is greatest. Note p(category | x) = p(x | category) p(category) / sum_i (p(x | category_i) p(category_i)), where p(category) is just (number of data in category) / (number of all data). If you work with logarithms, you can simplify the calculations somewhat.
Such a function can be constructed in a very few lines of a programming language which has matrix operations, such as Octave or R.

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Algorithms for matching based on keywords intersection

Suppose we have buyers and sellers that are trying to find each other in a market. Buyers can tag their needs with keywords; sellers can do the same for what they are selling. I'm interested in finding algorithm(s) that rank-order sellers in terms of their relevance for a particular buyer on the basis of their two keyword sets.
Here is an example:
buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"}
and then we have two potential sellers that we need to rank order in terms of their relevance:
seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"}
seller_keywords[2] = {"likes catnip", "furry",
"hates mice", "yarn-lover", "whiskers"}
If we just use the intersection of keywords, we do not get much discrimination: both intersect on 2 keywords. If we divide the intersection count by the size of the set union, seller 2 actually does worse because of the greater number of keywords. This would seem to introduce an automatic penalty for any method not correcting keyword set size (and we definitely don't want to penalize adding keywords).
To put a bit more structure on the problem, suppose we have some truthful measure of intensity of keyword attributes (which have to sum to 1 for each seller), e.g.,:
seller_keywords[1] = {"furry":.05,
"four legs":.05,
"arctic circle":.8,
"white":.1}
seller_keywords[2] = {"likes catnip":.5,
"furry":.4,
"hates mice":.02,
"yarn-lover":.02,
"whiskers":.06}
Now we could sum up the value of hits: so now Seller 1 only gets a score of .1, while Seller 2 gets a score of .9. So far, so good, but now we might get a third seller with a very limited, non-descriptive keyword set:
seller_keywords[3] = {"furry":1}
This catapults them to the top for any hit on their sole keyword, which isn't good.
Anyway, my guess (and hope) is that this is a fairly general problem and that there exist different algorithmic solutions with known strengths and limitations. This is probably something covered in CS101, so I think a good answer to this question might just be a link to the relevant references.
I think you're looking to use cosine similarity; it's a basic technique that gets you pretty far as a first hack. Intuitively, you create a vector where every tag you know about has a particular index:
terms[0] --> aardvark
terms[1] --> anteater
...
terms[N] --> zuckerberg
Then you create vectors in this space for each person:
person1[0] = 0 # this person doesn't care about aardvarks
person1[1] = 0.05 # this person cares a bit about anteaters
...
person1[N] = 0
Each person is now a vector in this N-dimensional space. You can then use cosine similarity to calculate similarity between pairs of them. Calculationally, this is basically the same of asking for the angle between the two vectors. You want a cosine close to 1, which means that the vectors are roughly collinear -- that they have similar values for most of the dimensions.
To improve this metric, you may want to use tf-idf weighting on the elements in your vector. Tf-idf will downplay the importance of popular terms (e.g, 'iPhone') and promote the importance of unpopular terms that this person seems particularly associated with.
Combining tf-idf weighting and cosine similarity does well for most applications like this.
what you are looking for is called taxonomy. Tagging contents and ordering them by order of relevance.
You may not find some-ready-to-go-algorithm but you can start with a practical case : Drupal documentation for taxonomy provides some guidelines, and check sources of the search module.
Basically, the ranks is based on the term's frequency. If a product is defined with a small number of tags, they will have more weight. A tag which only appear on few products' page means that it is very specific. You shouldn't have your words' intensity defined on a static way ; but examines them in their context.
Regards

Resources