Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I wish to use Google 2-grams for my project; but the data size renders searching expensive both in terms of speed and storage.
Is there a Web-API available for this purpose (in any language) ? The website http://books.google.com/ngrams/graph renders an image, can I get data values?
Well, I got a round about way of doing that, using Google BigQuery
In that, trigrams are available in public domain. Using Command line access did the job for me.
I found a great alternative: Microsoft Web N-Gram
It can be queried in different ways, including a straighforward GET call through the REST interface.
For instance, calling the URL:
http://weblm.research.microsoft.com/weblm/rest.svc/bing-body/apr10/1/jp?u={YOUR_TOKEN}&p=red+panda
returns
-9.005
which is the log likelihood of the phrase red panda.
Furthermore, it is handier than Google N-Grams, as for a given phrase it does not simply output its absolute frequency, but it can output its joint probability, conditional probability and even the most likely words that follow.
Disclaimer: I am not a Microsoft employee, I simply think that I just found an awesome service.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I've been told about car's plate image database that are avaliable on the web for free download to develop image processing and automatic number plate recognition algorithms, does anyone have a link to download or at least some keywords to search on the web?
If it's not legal or is there any ethic issues i would thank if you notice me.
It's perfectly legal to do so, as long as the images are CC (Creative Commons) licensed, or you have permission of the website owner to do.
A quick search for number plate image database yields some results:
Examples of test images
Academic Document (More examples of number plates)
Good small library of number plates.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
We have a scenario where we get tons of urls from customers, the urls are organized in arbitrary levels like : xxx.com/levelA/levelB/levelC/...levels.../xxxx we are trying to use this data and build a query system that can answer what urls are under any given level. for example, getAll("abc.com/test/sub/"), should give me all the urls that's been recorded that has "abc.com/test/sub/" as prefix, abc.com/test/sub/a.data, abc.com/test/sub/sub2/data etc.
This appears to be similar to a file directory structure. My question is, is there any existing open source project that can help handle such scenario. requirement is :
real time system.
high write/read throughput.
distributed and reliable.
Some questions you didn't answer:
What is a high write/read throughput? anything which an RDMS couldn't handle? What is your expected ratio of reads VS writes?
Why do you want to have a distributed system? Any particular reason?
How long are the strings an average and maximal?
Are you sure a simple MySQL, PostgreSQL, or any other commercial database (Oracle, SQL Server, ...) won't be enough?
Here is a question about MySQL varchar index length. I've encountered the same limitation of 255 characters also in SQL Server so I assume similar restrictions will exist for other RDBS. However there is nothing easier than just calling
SELECT url FROM url_list WHERE url like 'abc.com/test/sub/%'
There is also MongoDB which can be easily distributed and allows to use Regular Expressions in queries. Together with an index you could perform a similar request as in SQL. You would need to benchmark this specific case for yourself to see if there is a performance difference, and which system has it.
Otherwise, there would be still Couchbase and CouchDB which offer Views, which are basically made for something similar since they are built via MapReduce. However those take few seconds, up to a minute to be updated. So it isn't really fitting if you want to request the URL right after you've inserted it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm working on a school project on product analysis which is based on sentimental analysis. I've been looking for a training dataset for quite a some time now and what I've been able to find so far is a dataset for movie reviews. My question is, can I use this dataset for training the classifier, i.e. will it have an effect on the accuracy of classification? If so, does anyone here know where I can get a free dataset for product reviews?
I am assuming you are using some textual model like the bag of words model.
From my experiments, you usually don't get good results when changing from one domain to another (even if the train data set and the test are all products, but of different categories!).
Think of it logically, an oven that gets hot quickly usually indicate a good product. Is it also the same for laptops?
When I experimented with it a few years ago I used amazon comments as both train set and also to test my algorithms.
The comments are short and informative and were enough to get ~80% accuracy. The 'ground' truth was the stars system, where 1-2 stars were 'negative', 3 stars - 'neutral', and 4-5 stars 'positive'.
I used a pearl script from esuli.it to crawl amazon's comments.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Our site allows users to enter the company they work for as a free form text entry.
Historically we gathered around a few millions of unique entries. Since we put no constraints we ended up with a lot of variations, typos (e.g. over 1000 distinct entries just for McDonald's)
We realized we could provide our users with a great feature if only we could tie these variations together. We compiled a clean list of companies as a starting point using various online sources [Dictionary]
Now, we're trying to find out a best way to deal with the user data source. We thought about assigning some similarity score:
- comparing each entry with [Dictionary], calculating a lexical distance (possibly in Hadoop job)
- taking advantage of some search database (e.g. Solr)
And associate the user enter text this way.
What we're wondering is did anyone go through similar "classification" exercise and could share any tips?
Thanks,
Piotr
I'd use simple Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).
A few millions entries - you should be able to process it easily on one computer (no hadoop, or other heavy-weight tools).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I am a newbie in Matlab field. And i want to learn more about methodology to comparing 2 images to know the similarity between them.
I need more information in international journal / international proceeding, book or another reprort that describe about it.
I Will use it as my literature study.
Is there any suggestion what is the similar journal,book or proceeding that has discussed about it? If has, please include the title and link of them..
Thank You for the attention.
For journals I would recommend the IEEE Transactions on Image Processing:
http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=83
This is a good general intro from MIT:
http://www.mit.edu/~ka21369/Imaging2012/tannenbaum.pdf
You need to define "similarity" better.
In the image compression sense, similarity is a function of the pixel-wise difference between the images (PSNR, and other metrics).
In a computer vision sense, you would want to see if the two images contain similar content such as objects or scenes. I would recommend using Google Scholar for that.