Automatically linking categories to each other when categorizing text - algorithm

I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).
Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.
I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").
Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.

Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.
However, based on the information provided, I'd suggest the following:
Start by Building a word-based (rather than letter based) unigram or bigram model for each of your categories, based on the reference documents. If there are only few of these for each category (It seems you might have only one), you could use a semi-supervised approach, and throw in also the automatically categorized documents for each category. A relatively simple tool for building the model might be the CMU SLM toolkit.
Calculate the mutual-information (infogain) of each term or phrase in your model, with relation to other categories. if your categories are similar, you might need you use only neighboring categories to get meaningful result. This step would give the best separating terms higher scores.
Correlate the categories to each other based on the top-infogain terms or phrases. This could be done either by using euclidean or cosine distance between the category models, or by using a somewhat more elaborated techniques, like graph-based algorithms or hierarchic clustering.

Related

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

Detecting Similarity in Strings

If I search for something on Google News, I can click on the "Explore in depth" button and get the same news article from multiple sources. What kind of algorithm is used to compare articles of text and then determine that it is regarding the same thing? I have seen the Question here:
Is there an algorithm that tells the semantic similarity of two phrases
However, using methods mentioned there, I feel that if there were articles that were similar in nature but regarding different stories, they would be grouped together using the methods mentioned there. Is there a standard way of detecting Strings that are about the same thing and grouping them, while keeping Strings that are just similar separate? Eg. If I search "United States Border" I might get stories about problems at the USA's border, but what would prevent these from all getting grouped together? All I can think of is the date of publication, but what if many stories were published very close to each other?
One standard way to determine similarity of two articles is create a language model for each of them, and then find the similarity between them.
The language model is usually a probability function, assuming the article was created by a model that randomly selects tokens (words/bigrams/.../ngrams).
The simplest language model is for unigrams (words): P(word|d) = #occurances(w,d)/|d| (the number of times the word appeared in the document, relative to the total length of the document). Smoothing techniques are often used to prevent words having zero probability to appear.
After you have a language model, all you have to do is compare the two models. One way to do it is cosine similarity or Jensen-Shannon similarity.
This gives you an absolute score of similarity of two articles. This can be combined with many other methods, like your suggestion to compare dates.

Algorithm to recognize keywords' categories in a One-search-box-for-all model query

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25

Organizing large collection of things - usability and UI point of view

In our application we have a repository that contains things (they are called methods and queries, but this is not particularly relevant for this question). Each thing has a title, description (though some may lack both) and some other data. Users save things to repository and load and use things from repository.
I wonder what is the best way to organize the repository from usability point of view. There seems to be two major approaches. The first approach is to put things in folders, subfolders and so on, and have a hierarchical structure similar to a filesystem. The second approach (that has become fashionable) is two have a flat space and assign zero or more tags to each thing, so that users can view a list of things for a particular tag.
Currently we use flat space, tags and search. It appears to be somewhat unmanageable. I am not sure if switching to folders/subfolders will make it better.
I would like to learn more about the pros and cons of each approach and what properties of the collection and the things themselves suggest using one or another approach or a combination of both. If anybody can point me to some studies or discussions of those, I would really appreciate that.
There is no reason you can't use both methods. To some extent finding things is dependent upon what the thing is and why it is being looked for. A hierachical design can work well when somebody knows what they are looking for and a tag / keyword based system can work better when the structure is less obvious.
Also a network structure that links similar things can also be very good as you can see with the internet or a wikipedia.
I use the law of symmetry to help me in this situation.
First you build the tree like structure in the back end and then build the tagging system for the front end.
You use both to organize your data collection.
A tag cloud works better than a hierarchy if
the taxonomy is uncertain
("Now is this a small car or a large truck?")
there is no central authority for classification
there is no obivous or natural order between the classes
(cars can be classified by color or by size, there is no obvious rank between color and size)
new categories may be created on the fly
Otherwise, a hierarchy gives more confidence in completeness, as every item has exactly one obviously correct location: did I find all documents about birds? Is there really no document about five-story houses?
Tag clouds need some maintenance, I am not sure if this can be completely user-provided:
Dealing with synonyms, tag synonyms, merging tags, clarifying tags (e.g. is "blue" a feeling or a color?)
Another option are attribute-value pairs. They can be built upon a well-maintained tag cloud, e.g. grouping "red / black / blue" tags under "color". They can also work with floating values, search can be extended to similar values in case of not enough results (such as age, date, even multidimensionals like color).
However, this requires to know ahead what search criteria users need. If you need to introduce a new category, you need to re-tag the entire body of documents.
See also my request for clarification: what are the problems? Not enough tagging? tagging to distinct? Users not finding what they are loking for? Users not confident in search results?

What's the best way to classify a list of sites?

I have a list of X sites that I need to classify in some way. Is the site about cars, health, products or is it about everything(wikihow, about.com, etc?) What are some of the better ways to classify sites like this? Should I get keywords that bring traffic to the site and use those? Should I read the content of some random pages and judge it off of that?
Well if the site is well designed there will be meta tags in the header specifically for this.
Yahoo has a api to extract terms, http://developer.yahoo.com/search/content/V2/termExtraction.html
"The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content. It is one of the technologies used in Y!Q."
Maybe I'm a bit biased (disclaimer : I have a degree in library science, and this topic is one of the reasons I got the degree), so the easiest answer is that there is no best way.
Consider this like you would database design -- once you have your system populated, what sort of questions are you going to ask of it?
Is the fact that the site is run by the government significant? Or that it uses flash? Or that the pages are blue? Or that it's a hobbyist site? Or that the intended audience is children?.
Then we get the question of if we're going to have a hierarchical category for any of the facets we're concerned with -- if it's about both cars and motorcycles, should we use the term 'vehicles' instead? And if we do that, will we use keyword expansion so that 'motorcycle' matches the broader terms (ie, vehicles) as well?
So ... the point is ... figure out what your needs are, and work towards that. 'Best' will never come, even with years of refinement (if anything, it gets more difficult, as terms start changing meanings. Remember when 'weblog' was related to web server metrics?)
This is a tough question to answer. Consider:
How granular do you want your classification to be?
Do you want to classify sites based on your own criteria or the criteria provided by the sites? In other words, if a site classifies itself as "a premier source for motorcycle maintenance", do you want to create a "motorcycle maintenance" category just for that site? This, of course, will cause your list to become inconsistent. However, if you pigeon-hole the sites to follow your own classification scheme, there is a loss of information, and a risk that the site will not match any of the categories you've defined.
Do you allow subcategories? The problem becomes much more complicated if so.
Can a site belong to more than one category? If so, is there an ordering or a weight (ie. Primary Category, Secondary Categories, etc.), or do you follow a scheme similar to SO's tags?
As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."
For instance, given the following categories:
{ "Cars", "Motorcycles", "Video Games" }
Spidering the following blocks of text from a site:
The title is an incongruous play on the title of the book Zen in the Art of Archery by Eugen Herrigel. In its introduction, Pirsig explains that, despite its title, "it should in no way be associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either."
and:
Most motorcycles made since 1980 are pretty reliable if properly maintained but that's a big if. To some extent the high reliability of today's motorcycles has worked to the disadvantage of many riders. Some riders have been lulled into believing that motorcycles are like modern cars and require essentially no maintenance. This is not the case (even with cars). Modern bikes require less maintenance than they did in the 60's and 70's but they still need a lot more maintence than a car. This higher reliability also means that there are a a whole bunch of motorcyclists out there who haven't a clue how to work on their bikes or what really needs to be done to ensure reliability.
We get the following scores:
{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }
And we can thus categorize the site as being related mostly to "Motorcycles".
Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.
You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.

Resources