How to classify a large collection of user entered company names? [closed] - algorithm

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Our site allows users to enter the company they work for as a free form text entry.
Historically we gathered around a few millions of unique entries. Since we put no constraints we ended up with a lot of variations, typos (e.g. over 1000 distinct entries just for McDonald's)
We realized we could provide our users with a great feature if only we could tie these variations together. We compiled a clean list of companies as a starting point using various online sources [Dictionary]
Now, we're trying to find out a best way to deal with the user data source. We thought about assigning some similarity score:
- comparing each entry with [Dictionary], calculating a lexical distance (possibly in Hadoop job)
- taking advantage of some search database (e.g. Solr)
And associate the user enter text this way.
What we're wondering is did anyone go through similar "classification" exercise and could share any tips?
Thanks,
Piotr

I'd use simple Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).
A few millions entries - you should be able to process it easily on one computer (no hadoop, or other heavy-weight tools).

Related

index large amount of url as a directory structure [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
We have a scenario where we get tons of urls from customers, the urls are organized in arbitrary levels like : xxx.com/levelA/levelB/levelC/...levels.../xxxx we are trying to use this data and build a query system that can answer what urls are under any given level. for example, getAll("abc.com/test/sub/"), should give me all the urls that's been recorded that has "abc.com/test/sub/" as prefix, abc.com/test/sub/a.data, abc.com/test/sub/sub2/data etc.
This appears to be similar to a file directory structure. My question is, is there any existing open source project that can help handle such scenario. requirement is :
real time system.
high write/read throughput.
distributed and reliable.
Some questions you didn't answer:
What is a high write/read throughput? anything which an RDMS couldn't handle? What is your expected ratio of reads VS writes?
Why do you want to have a distributed system? Any particular reason?
How long are the strings an average and maximal?
Are you sure a simple MySQL, PostgreSQL, or any other commercial database (Oracle, SQL Server, ...) won't be enough?
Here is a question about MySQL varchar index length. I've encountered the same limitation of 255 characters also in SQL Server so I assume similar restrictions will exist for other RDBS. However there is nothing easier than just calling
SELECT url FROM url_list WHERE url like 'abc.com/test/sub/%'
There is also MongoDB which can be easily distributed and allows to use Regular Expressions in queries. Together with an index you could perform a similar request as in SQL. You would need to benchmark this specific case for yourself to see if there is a performance difference, and which system has it.
Otherwise, there would be still Couchbase and CouchDB which offer Views, which are basically made for something similar since they are built via MapReduce. However those take few seconds, up to a minute to be updated. So it isn't really fitting if you want to request the URL right after you've inserted it.

Looking for product reviews dataset [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm working on a school project on product analysis which is based on sentimental analysis. I've been looking for a training dataset for quite a some time now and what I've been able to find so far is a dataset for movie reviews. My question is, can I use this dataset for training the classifier, i.e. will it have an effect on the accuracy of classification? If so, does anyone here know where I can get a free dataset for product reviews?
I am assuming you are using some textual model like the bag of words model.
From my experiments, you usually don't get good results when changing from one domain to another (even if the train data set and the test are all products, but of different categories!).
Think of it logically, an oven that gets hot quickly usually indicate a good product. Is it also the same for laptops?
When I experimented with it a few years ago I used amazon comments as both train set and also to test my algorithms.
The comments are short and informative and were enough to get ~80% accuracy. The 'ground' truth was the stars system, where 1-2 stars were 'negative', 3 stars - 'neutral', and 4-5 stars 'positive'.
I used a pearl script from esuli.it to crawl amazon's comments.

Journal / Proceeding about comparing the similarity of 2 images? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I am a newbie in Matlab field. And i want to learn more about methodology to comparing 2 images to know the similarity between them.
I need more information in international journal / international proceeding, book or another reprort that describe about it.
I Will use it as my literature study.
Is there any suggestion what is the similar journal,book or proceeding that has discussed about it? If has, please include the title and link of them..
Thank You for the attention.
For journals I would recommend the IEEE Transactions on Image Processing:
http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=83
This is a good general intro from MIT:
http://www.mit.edu/~ka21369/Imaging2012/tannenbaum.pdf
You need to define "similarity" better.
In the image compression sense, similarity is a function of the pixel-wise difference between the images (PSNR, and other metrics).
In a computer vision sense, you would want to see if the two images contain similar content such as objects or scenes. I would recommend using Google Scholar for that.

Google N-Gram Web API [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I wish to use Google 2-grams for my project; but the data size renders searching expensive both in terms of speed and storage.
Is there a Web-API available for this purpose (in any language) ? The website http://books.google.com/ngrams/graph renders an image, can I get data values?
Well, I got a round about way of doing that, using Google BigQuery
In that, trigrams are available in public domain. Using Command line access did the job for me.
I found a great alternative: Microsoft Web N-Gram
It can be queried in different ways, including a straighforward GET call through the REST interface.
For instance, calling the URL:
http://weblm.research.microsoft.com/weblm/rest.svc/bing-body/apr10/1/jp?u={YOUR_TOKEN}&p=red+panda
returns
-9.005
which is the log likelihood of the phrase red panda.
Furthermore, it is handier than Google N-Grams, as for a given phrase it does not simply output its absolute frequency, but it can output its joint probability, conditional probability and even the most likely words that follow.
Disclaimer: I am not a Microsoft employee, I simply think that I just found an awesome service.

Are there any online user group meetings? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I must admit that I am incredibly jealous of those developers who happen to live near active user groups (e.g. the ALT.NET guys in Austin). I often read blog posts and listen to podcasts that reference these in-person meetings and find myself wishing that I could sit in and participate as well. But it just isn't realistic to fly across the country to meet a few guys for a couple hours in a pub to talk about patterns and practices.
So I was wondering if there was a similar discussion forum for those who don't happen to live near an active user group. After all, blogs and books only go so far, and for the most part are a one-way avenue of communication. True, you can use comments, e-mails, tweets, and IM to get some interaction, but there is something to be said about face-to-face real-time interaction that will get lost in all of these mediums.
I guess what I'm looking for is some sort of video-conferencing deal where people who share an interest in a specific field of software development can get together to talk and interact without having to live right next door to each other. Does anything like this exist?
There's a .NET usergroup in SecondLife. Of course this depends how you feel about second life.
I haven't had a chance to check it out, but the Virtual ALT.NET group sounds promising.

Resources