I'm studying Redis, not gonna construct Redis cluster that uses more than 256TB though,
I'm just curious about the question that i put in the title.
I think it is impossible to use more than 256TB, because current OSs use 64 bit address system. (actually, 64 bit address system uses only 48 bit so 2^48 bytes = 256TBs)
Anyway i'm telling that redis can scale out by generating cluster but it has certain limitation.
You guys, how do you think about it?? please give me your ideas.
Yes it can. Redis cluster is not hosted on a single OS, thus not bounded by the limit you are talking about.
Related
I would like to try testing NLP tools against dumps of the web and other corpora, sometimes larger than 4 TB.
If I run this on a mac it's very slow. What is the best way to speed up this process?
deploying to EC2/Heroku and scaling up servers
buying hardware and creating a local setup
Just want to know how this is usually done (processing terabytes in a matter of minutes/seconds), if it's cheaper/better to experiment with this in the cloud, or do I need my own hardware setup?
Regardless of the brand of your cloud, the whole idea of cloud computing is to be able to scale-up and scale down in a flexible way.
In a corporate environment you might have a scenario in which you will consistently need the same amount of computing resources, so if you already have them, it is rather a difficult case to use the cloud because you just don't need the flexibility provided.
On the other hand if your processing tasks are not quite predictable, your best solution is the cloud because you will be able to pay more when you use more computing power, and then pay less when you don't need as much power.
Take into account though, that not all cloud-solutions are the same, for instance, a Web role is a highly web-dedicated node whose main purpose is to serve web requests, the more requests are served, the more you pay.
Whereas in a virtual role, is almost like you are given the exclusivity of a computer system that you can use for anything you want, either a linux or a windows OS, the system keeps running even though you are not using it at its best.
Overall, the costs depend on your own scenario and how well it fits to your needs.
I suppose it depends quite a bit on what kind of experimenting you are wanting to do, for what purpose and for how long.
If you're looking into buying the hardware and running your own cluster then you probably want something like Hadoop or Storm to manage the compute nodes. I don't know how feasible it is to go through 4TB of data in a matter of seconds but again that really depends on the kind of processing you want to do. Counting the frequency of words in the 4TB corpus should be pretty easy (even or your mac), but building SVMs or doing something like LDA on the lot won't be. One issue you'll run into is that you won't have enough memory to fit all of that, so you'll want a library that can run the methods off disk.
If you don't know exactly what your requirements are then I would use EC2 to setup a test rig to gain a better understanding what it is that you want to do and how much grunt/memory that needs to get done in the amount of time you require.
We recently bought two compute nodes 128 cores each with 256Gb of memory and a few terabytes of disk space for I think it was around £20k or so. These are AMD interlagos machines. That said the compute cluster already had infiniband storage so we just had to hook up to that and just buy to two compute nodes, not the whole infrastructure.
The obvious thing to do here is to start off with a smaller data set, say a few gigabytes. That'll get you started on your mac, you can experiment with the data and different methods to get an idea of what works and what doesn't, and then move your pipeline to the cloud, and run it with more data. If you don't want to start the experimentation with a single sample, you can always take multiple samples from different parts of the full corpus, just keep the sample sizes down to something you can manage on your own workstation to start off with.
As an aside, I highly recommend the scikit-learn project on GitHub for machine learning. It's written in Python, but most of the matrix operations are done in Fortran or C libraries so it's pretty fast. The developer community is also extremely active on the project. Another good library that perhaps a bit more approachable (depending on your level of expertise) is NLTK. It's nowhere near as fast but makes a bit more sense if you're not familiar with thinking about everything as a matrix.
UPDATE
One thing I forgot to mention is the time your project will be running. Or to put it another way, how long will you get some use out of your specialty hardware. If it's a project that is supposed to serve the EU parliament for the next 10 years, then you should definitely buy the hardware. If it's a project for you to get familiar with NLP, then forking out the money might be a bit redundant, unless you're also planning on starting you own cloud computing rental service :).
That said, I don't know what the real world costs of using EC2 are for something like this. I've never had to use them.
For example i have db with 20 GB of data and only 2 GB ram,swap is off. Will i be able to find and insert data? How bad perfomance would be?
it's best to google this, but many sources say that when your working set outgrows your RAM size the performance will drop significantly.
Sharding might be an interesting option, rather than adding more RAM..
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
http://highscalability.com/blog/2011/9/13/must-see-5-steps-to-scaling-mongodb-or-any-db-in-8-minutes.html
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
http://groups.google.com/group/mongodb-user/browse_thread/thread/37f80ff39258e6f4
Can MongoDB work when size of database larger then RAM?
What does it mean to fit "working set" into RAM for MongoDB?
You might also want to read-up on the 4square outage last year:
http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html
http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e
http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
side-note:
you said "swap is off" ... ? why? You should always have a sufficient swap space on a UNIX system! Swap-size = 1...2-times RAM size is a good idea. Using a fast partition is a good idea. Really bad things happen if your UNIX system runs out of RAM and doesn't have Swap .. processes just die inexplicably.. that is a bad very thing! especially in production. Disk is cheap! add a generous swap partition! :-)
It really depends on the size of your working set.
MongoDB can handle a very large database and still be very fast if your working set is less than your RAM size.
The working set is the set of documents you are working on a time and indexes.
Here is a link which might help you understand this : http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
I currently have a quad code single processor dedicated hosting with 4GB of RAM at softlayer. I am contemplating upgrading to a dual processor dual core (or quad core). While doing the price comparison with the reserved large instance in amazon, it seems the price is quite comparable to similar dedicated hosting (maybe ec2 is little cheaper like to like).
Anyone has any other point of view or experience that can shed some more light on this? I want to keep the server running 24 x 7 and my concern is the processor speed (not sure what is amazon's computing unit capabilities) and RAM. For hard disk, I guess I will have to use the elastic storage to avoid loss in case of server breakdown!
If you want to have a server running all the time I usually find the dedicated servers cheaper than cloud ones. In the cloud you pay a bit more for the dynamics that you start and stop server whenever you want.
As for ECU. That is a pity that Amazon does not say how they actually measure it. There is a pretty decent try to measure what it means with multiple benchmarks in this article. But they ended with strongly non-linear scale. Another source tells what ECU is directly proportional to UnixBench - first question on this page. Actually the second link is for service that makes comparison of prices in cloud computing. You may find that Amazon may not necessary have the cheapest CPU. But you should be careful though - the CPU measure is based on the mentioned ECU measurement, which not necessary reflect later actual application performance.
I'm developing a client/server application where the server holds large pieces of data such as big images or video files which are requested by the client and I need to create an in-memory client caching system to hold a few of those large data to speed up the process. Just to be clear, each individual image or video is not that big but the overall size of all of them can be really big.
But I'm faced with the "how much data should I cache" problem and was wondering if there are some kind of golden rules on Windows about what strategy I should adopt. The caching is done on the client, I do not need caching on the server.
Should I stay under x% of global memory usage at all time ? And how much would that be ? What will happen if another program is launched and takes up a lot of memory, should I empty the cache ?
Should I request how much free memory is available prior to caching and use a fixed percentage of that memory for my needs ?
I hope I do not have to go there but should I ask the user how much memory he is willing to allocate to my application ? If so, how can I calculate the default value for that property and for those who will never use that setting ?
Rather than create your own caching algorithms why don't you write the data to a file with the FILE_ATTRIBUTE_TEMPORARY attribute and make use of the client machine's own cache.
Although this approach appears to imply that you use a file, if there is memory available in the system then the file will never leave the cache and will remain in memory the whole time.
Some advantages:
You don't need to write any code.
The system cache takes account of all the other processes running. It would not be practical for you to take that on yourself.
On 64 bit Windows the system can use all the memory available to it for the cache. In a 32 bit Delphi process you are limited to the 32 bit address space.
Even if your cache is full and your files to get flushed to disk, local disk access is much faster than querying the database and then transmitting the files over the network.
It depends on what other software runs on the server. I would make it possible to configure it manually at first. Develop a system that can use a specific amount of memory. If you can, build it so that you can change that value while it is running.
If you got those possibilities, you can try some tweaking to see what works best. I don't know any golden rules, but I'd figure you should be able to set a percentage of total memory or total available memory with a specific minimum amount of memory to be free for the system at all times. If you save a miminum of say 500 MB for the server OS, you can use the rest, or 90% of the rest for your cache. But those numbers depend on the version of the OS and the other applications running on the server.
I think it's best to make the numbers configurable from the outside and create a management tool that lets you set the values manually first. Then, if you found out what works best, you can deduct formulas to calculate those values, and integrate them in your management tool. This tool should not be an integral part of the cache program itself (which will probably be a service without GUI anyway).
Questions:
One image can be requested by multiple clients? Or, one image can be requested by multiple times in a short interval?
How short is the interval?
The speed of the network is really high? Higher than the speed of the hard drive?? If you have a normal network, then the harddrive will be able to read the files from disk and deliver them over network in real time. Especially that Windows is already doing some good caching so the most recent files are already in cache.
The main purpose of the computer that is running the server app is to run the server? Or is just a normal computer used also for other tasks? In other words is it a dedicated server or a normal workstation/desktop?
but should I ask the user how much
memory he is willing to allocate to my
application ?
I would definitively go there!!!
If the user thinks that the server application is not a important application it will probably give it low priority (low cache). Else, it it thinks it is the most important running app, it will allow the app to allocate all RAM it needs in detriment of other less important applications.
Just deliver the application with that setting set by default to a acceptable value (which will be something like x% of the total amount of RAM). I will use like 70% of total RAM if the main purpose of the computer to hold this server application and about 40-50% if its purpose is 'general use' computer.
A server application usually needs resources set aside for its own use by its administrator. I would not care about others application behaviour, I would care about being a "polite" application, thereby it should allow memory cache size and so on to be configurable by the administator, which is the only one who knows how to configure his systems properly (usually...)
Defaults values should anyway take into consideration how much memory is available overall, especially on 32 bit systems with less than 4GB of memory (as long as Delphi delivers only 32 bit apps), to leave something free to the operating systems and avoids too frequent swapping. Asking the user to select it at setup is also advisable.
If the application is the only one running on a server, a value between 40 to 75% of available memory could be ok (depending on how much memory is needed beyond the cache), but again, ask the user because it's almost impossible to know what other applications running may need. You can also have a min cache size and a max cache size, start by allocating the lower value, and then grow it when and if needed, and shrink it if necessary.
On a 32 bit system this is a kind of memory usage that could benefit from using PAE/AWE to access more than 3GB of memory.
Update: you can also perform a monitoring of cache hits/misses and calculate which cache size would fit the user needs best (it could be too small but too large as well), and the advise the user about that.
To be honest, the questions you ask would not be my main concern. I would be more concerned with how effective my cache would be. If your files are really that big, how many can you hold in the cache? And if your client server app has many users, what are the chances that your cache will actually cache something someone else will use?
It might be worth doing an analysis before you burn too much time on the fine details.
I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.
My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.
It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.
Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.