Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
We have a scenario where we get tons of urls from customers, the urls are organized in arbitrary levels like : xxx.com/levelA/levelB/levelC/...levels.../xxxx we are trying to use this data and build a query system that can answer what urls are under any given level. for example, getAll("abc.com/test/sub/"), should give me all the urls that's been recorded that has "abc.com/test/sub/" as prefix, abc.com/test/sub/a.data, abc.com/test/sub/sub2/data etc.
This appears to be similar to a file directory structure. My question is, is there any existing open source project that can help handle such scenario. requirement is :
real time system.
high write/read throughput.
distributed and reliable.
Some questions you didn't answer:
What is a high write/read throughput? anything which an RDMS couldn't handle? What is your expected ratio of reads VS writes?
Why do you want to have a distributed system? Any particular reason?
How long are the strings an average and maximal?
Are you sure a simple MySQL, PostgreSQL, or any other commercial database (Oracle, SQL Server, ...) won't be enough?
Here is a question about MySQL varchar index length. I've encountered the same limitation of 255 characters also in SQL Server so I assume similar restrictions will exist for other RDBS. However there is nothing easier than just calling
SELECT url FROM url_list WHERE url like 'abc.com/test/sub/%'
There is also MongoDB which can be easily distributed and allows to use Regular Expressions in queries. Together with an index you could perform a similar request as in SQL. You would need to benchmark this specific case for yourself to see if there is a performance difference, and which system has it.
Otherwise, there would be still Couchbase and CouchDB which offer Views, which are basically made for something similar since they are built via MapReduce. However those take few seconds, up to a minute to be updated. So it isn't really fitting if you want to request the URL right after you've inserted it.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm working on a school project on product analysis which is based on sentimental analysis. I've been looking for a training dataset for quite a some time now and what I've been able to find so far is a dataset for movie reviews. My question is, can I use this dataset for training the classifier, i.e. will it have an effect on the accuracy of classification? If so, does anyone here know where I can get a free dataset for product reviews?
I am assuming you are using some textual model like the bag of words model.
From my experiments, you usually don't get good results when changing from one domain to another (even if the train data set and the test are all products, but of different categories!).
Think of it logically, an oven that gets hot quickly usually indicate a good product. Is it also the same for laptops?
When I experimented with it a few years ago I used amazon comments as both train set and also to test my algorithms.
The comments are short and informative and were enough to get ~80% accuracy. The 'ground' truth was the stars system, where 1-2 stars were 'negative', 3 stars - 'neutral', and 4-5 stars 'positive'.
I used a pearl script from esuli.it to crawl amazon's comments.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Our site allows users to enter the company they work for as a free form text entry.
Historically we gathered around a few millions of unique entries. Since we put no constraints we ended up with a lot of variations, typos (e.g. over 1000 distinct entries just for McDonald's)
We realized we could provide our users with a great feature if only we could tie these variations together. We compiled a clean list of companies as a starting point using various online sources [Dictionary]
Now, we're trying to find out a best way to deal with the user data source. We thought about assigning some similarity score:
- comparing each entry with [Dictionary], calculating a lexical distance (possibly in Hadoop job)
- taking advantage of some search database (e.g. Solr)
And associate the user enter text this way.
What we're wondering is did anyone go through similar "classification" exercise and could share any tips?
Thanks,
Piotr
I'd use simple Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).
A few millions entries - you should be able to process it easily on one computer (no hadoop, or other heavy-weight tools).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I wish to use Google 2-grams for my project; but the data size renders searching expensive both in terms of speed and storage.
Is there a Web-API available for this purpose (in any language) ? The website http://books.google.com/ngrams/graph renders an image, can I get data values?
Well, I got a round about way of doing that, using Google BigQuery
In that, trigrams are available in public domain. Using Command line access did the job for me.
I found a great alternative: Microsoft Web N-Gram
It can be queried in different ways, including a straighforward GET call through the REST interface.
For instance, calling the URL:
http://weblm.research.microsoft.com/weblm/rest.svc/bing-body/apr10/1/jp?u={YOUR_TOKEN}&p=red+panda
returns
-9.005
which is the log likelihood of the phrase red panda.
Furthermore, it is handier than Google N-Grams, as for a given phrase it does not simply output its absolute frequency, but it can output its joint probability, conditional probability and even the most likely words that follow.
Disclaimer: I am not a Microsoft employee, I simply think that I just found an awesome service.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
There are a lot of tutorial on the Internet which claim to teach you how to use memcacheD, but most of them are about memcache (hence the emphasis on the d).
In php, memcached doesn't even have a connect method. Also a lot of these tutorial just teach you how to connect and add values, but I can figure that out by reading the manual, so please help me to create a one stop reference for memcached. What strategies would you recommend, what are the best practices? How would you cache something like a forum, or a social site with ever-changing data?
The trouble I seem to have is, I know how to connect, add and remove values, but what exactly am I suppose to cache? (I'm just experimenting, this is not for a project, so I can't really give an example).
but what exactly am I suppose to
cache?
You're supposed to cache the data that doesn't change often and is read many times. For example, let's take a forum - you'd cache the initial page of the forum that displays forums available, forum description and forum IDs that allow you to see topics under various forum categories.
Since it's not likely that you create, delete or update forums every second, it's safe to assume that the read:write ratio is in favor of read which allows you to cache that front page where you display forums and by doing so, you are alleviating the load on your database since it doesn't have to be accessed for most visits to your site.
You can also take this caching one step further - you cache everything your site has to offer and you set your cache expiry time to 5 minutes. Assuming your database isn't huge (hundreds of gigabytes) and that it fits available RAM - you'd effectively query your database every 5 minutes.
Assuming you have a lot of visits per day (let's say 20 000 unique visits) - you can calculate how much it attributes to saving resources when connecting to database and extracting data.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need some sort of node-graph editor, that hopefully works on both Mac and other platforms, to generate user created node collections with properties. The graph data will then be used in a data-driven application I'm working on, so kudos if the application can save the graphs in some easy to process format. So far I was using XML with a tree editor, but since the graphs can be cyclic according to the requirements, the tree editor no longer cuts it.
Plugins for other applications would also be ok!
GraphViz' graph drawing software is pretty much the best there is, cross-platform, with a very simple file format and lots of output formats. It is especially good in automatically calculating a layout for graphs. A GUI for OS X is available.
Have a look at Yed (http://www.yworks.com/en/products_yed_about.html), free to use but places a logo on all output.
It comes with Mac OS binaries .. and you might be able to include ($$ required) the graphing engine it is based on into your project.
I've used it (with limited success) to document enterprise data-flows..
You might want to do something with JHotdraw (at sourceforge). It is one of the design patterns demo projects converted from Smalltalk. It is (or was before it was put on sf) very well documented and easy to extend. A similar (but less well-documented) framework is GEF in Eclipse.
You can take a look at OmniGraffle: http://www.omnigroup.com/omnigraffle