Loading PetaBytes of data at scale

Loading PetaBytes of data at scale - performance

I need to load petabytes of text data into a storage (RAM/SSD) within a second.
Below are some of the question to solve the above problem.
1) Is it practically/theoretically possible to load petabytes of data in a second?
2) What will be the best design approach in order to achive fast loading of petabyte scale data in sub seconds.
3) Any benchmark approach available?.
I am okay to implement with any kind of technologies like Hadoop, spark, HPCC etc...

"petabytes .... within a second". seriously? Please check wikipedia Petabyte: it is 1.000.000 GB!
Also check wikipedia Memory bandwidth. Even the fastest RAM cannot handle more than a few 10 GB / s (in practice this is far lower).
Just curious: what is your use-case?

No, it is not technically possible at this time. Not even RAM memory is fast enough (not to mention the obvious capacity constraints). The fastest SSD (M.2 drives) you can get write speed around 1.2GB/s and with raid 0, you might achieve speeds just around 3GB/s at most. There are also economical constraints, as those drives by themselves are quite expensive. So to answer your question, those speeds are technically impossible at current time.

From HPCC perspective...
Thor is designed to load data and support multiple servers. However the biggest cluster I heard about is about 4000 servers. Thor is designed to load a lot of data over long time (even a week).
In the other hand Roxie is designed to serve data quickly but is not what you are asking for...nor it could serve Petabytes under a second.

Related

Redis vs memcached vs Scylla Cache - Which one to choose?

I'm designing an application where I want to cache million data each around 10kb.. I did some analysis and on the fence between using Redis vs memcached vs Scylla as Cache.. Can some experts suggests which might best suits my needs?
Highly performant
High availability
High Throughput
Low pricing?

Full disclosure - I work on the Scylla project.
I think it is a question of latency and HA vs cost. As a RAM-based system, Redis will be the lowest latency. If you need < 1 millisecond response, then Redis or memcached are the choice.
Scylla is a disk-based system. Those values that are in Scylla's RAM will be low latency, but those that need to pull from disk will be slower. So your 99p latency is likely to be slower. How slow? Depends on your disk. NVME can be 99p 3-5 ms. SSD, maybe 5-10 ms. If this is an acceptable latency, then Scylla will be much less expensive, as even NVME is much cheaper than RAM.
As for HA - Redis and memcached are intended as a cache. While there are some features and frameworks that you can use to replicate data around, these are all bolt-ons and increase complexity. Scylla is a distributed system by design. So the replication to allow for multiple layers of HA is built-in (node, rack and DC-availability)
Redis (and to a lesser extend, memcached) are phenomenal caches. But, depending upon your use case, Scylla might be the right choice.

All three options you mentioned are open-source software, so the pricing is the same - zero :-) However, both Scylla and Redis are written and backed by companies (ScyllaDB and RedisLabs, respectively), so if your use case is mission-critical you may choose to pay these companies for enteprise-level support, you can inquire with these companies what are their prices.
The more interesting difference between the three is in the technology.
You described a use case where you have 10 GB of data in the cache. This amount can be easily held in memory, so a completely in-memory database like Memcached or Redis is a natural choice. However, there are still questions you need to ask yourself, which may lead you to a distributed database, such as Scylla depending on your answers:
Would you be using powerful many-core machines? If so, you should probably rule out Memcached - my experience (and others' - see
Can memcached make full use of multi-core?) suggests that it does not scale well with many cores. On an 8-core machine you will not get anywhere close to 8 times the performance of a one-core machine.
Redis is also not really meant for multi-core use - https://redis.io/topics/benchmarks says that Redis "is not designed to benefit from multiple CPU cores. People are supposed to launch several Redis instances to scale out on several cores if needed.". Scylla, on the other hand, thrives on multi-core machines. You should probably test the performance of all three products on your use case before making a decision.
How much of a disaster would be to suddenly lose the entire content of your cache? In some use cases, it just means you would need to query some slightly-slower backend server, so suddenly losing the cache on reboot is acceptable. In such cases, a memory-only cache like Memached or Redis is probably exactly what you need. However, in other cases, there may be a big penalty for starting from scratch with an empty cache - the backend server might be very slow, or maybe the original content is stored on a far-away server with a slow and expensive WAN. In such a case you would want a disk-backed cache, so if the memory cache is lost, you can still refresh it from disk and not from the backend server. Redis has a disk backing option, and in Scylla disk backing is the main way.
You mentioned a working set of 10 GB, which can easily fit memory of a single server. But is it possible this will grow and in a year you'll find yourself needing to cache 100 GB or 1 TB, which no longer fits the memory of a single server? In memcached you'll be out of luck. Redis used to have a "virtual memory" solution for this purpose, but it is deprecated and https://redis.io/topics/virtual-memory now states that Redis is "without considering at least for now the support for databases bigger than RAM". Scylla does handle this issue in two ways. First, your cache would be stored on disk which can be much larger than memory (and whatever amount of memory you have will be used to further speed up that cache, but it doesn't need to fit memory). Second, Scylla is a distributed server. It can distribute a 100 GB working set to 10 different nodes. Redis also has "replication", but it copies the entire data to all nodes - while Scylla can optionally store different subsets of the data on different nodes.

In-memory is actually a bad thing since RAM is expensive and not persistent.
So Scylla will be a better option for K/V or columnar workloads.
Scylla also has a limited Redis api with good results [1], using the CQL
api will result in better results.
[1] https://medium.com/#siddharthc/redis-on-nvme-with-scylladb-5e12afd38dbc

Rethinkdb Scalability

How scalable is rethinkdb? Can it be used for TB's of data?
I have around 400 GB's of data, which are bound to increase by 10-25 GB's per week. Any suggestions, would be of great help.

Rethinkdb is very scalable and you should be able to use it with TBs of data no problem, that's what its designed for. As for its stability it should be very stable as of version 2.0 and fully production.
http://www.rethinkdb.com/stability/

Note that a NoSQL-Database does increase the amount of storage required massively, so be sure to have enough disk space on your nodes.
But some terabytes are no issue for RethinkDB, it's designed to handle a nearly infinite amount of data if your nodes have a lot of RAM it also provides nearly Memory-alike performance because the caching algorithms are very good.
You can easily spread out your data on many nodes, by sharding, which also increases the absolute throughput of your cluster, but slightly increases the latency of queries, because of the round trips between the cluster-members.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.

This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.

A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.

I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Does it make sense to optimize queries for less i/o pressure?

I have a read only database (product) that recides on its own Sql Server 2008.
I already optimized queries by looking at most expensive queries in activity monitor - report. I ordered the report by CPU-cost. I now have something like 50 queries/second and no query is longer than 300ms.
CPU-Time is ok (30%) and Memory is only used by 20% (out of 64GB).
There is one issue: disk time is at steady 100% (I looked at idle time performance counter and used ideras SQL diagnostic manager). I can see that the product db behaves different than my order db which is on a different machine and has smaller tables: If I look at a profiler trace I have queries in product-db that show a value in column "read" higher than 50.000. In my order DB these values are never higher than 1000. The queries in product-db use a lot of Common table expressions, work on large tables (some are around 5 Million entries).
I am not shure if I should invest time in optimizing queries for i/o performance or if I should just add a server. By otimizing for query duration I already added the missing indexes. Is optimizing for i/o something that is usually done?

In short, yes. Optimize for both CPU and IO.
Queries with high CPU tend to be doing unnecessary in-memory sorts, (sometimes inefficient) hash joins, or complex logic.
Queries with high IO (Page Reads) tend to be doing full table scans or working in other inefficient ways.
9 times out of 10, the same queries will be near the top of the list, but if you've worked on the high CPU and you still are unhappy with performance, then by all means, work on the high IO procs next.

There's always a next bottleneck.
they say.
Now that you've tuned CPU usage, it's only natural that I/O load emerges as dominant. Is your performance already acceptable? If yes stop, if no you have to estimate how many hours you will have to invest in further tuning and if buying another server or more hard disks might be cheaper.
Regarding the I/O tuning again, try to see what you can achieve with easy measures. Sometimes you can trade CPU for I/O and vice versa. Compression is an example for this. You would then tune that component that is your current bottlneck.
Before you seek to make the I/O faster try to reduce the I/O that is generated.

Look for obvious IO performance improvements for your query, but more importantly, look at how you can improve your IO performance at the server level.
If your other resources (CPU and memory) aren't overloaded, you probably don't need a new server. Consider adding an SSD for logs and temp files, and/or consider if you can affordably fit your whole DB onto an array of SSDs.
Of course, clearing out your disk IO bottleneck is likely to raise CPU usage, but if your performance is close to acceptable, this will probably improve things to the point that you can stop optimizing for now.

Unless you are using SSDs or a DB optimized SAN then IO is almost always the limit in database applications.
So yes, optimize to get rid of it as much as possible.
Table indexes are the first thing to do.
Then, add as much RAM as you possibly can, up to the complete size of your DB files.
Then partition your data tables (if that is a reasonable thing to do) so that any necessary table or index scans are done on only one or two table partitions.
Then I suppose you either buy bigger machines with even more RAM and/or buy SSDs or a SAN or a SAN with SSDs.
Alternatively you rebuild your entire database application to use something like NoSQL or database sharding, and implement all your relations, joins, constraints, etc in a middle interface layer.

How to create a system with 1500 servers that deliver results instantaneously?

I want to create a system that delivers user interface response within 100ms, but which requires minutes of computation. Fortunately, I can divide it up into very small pieces, so that I could distribute this to a lot of servers, let's say 1500 servers. The query would be delivered to one of them, which then redistributes to 10-100 other servers, which then redistribute etc., and after doing the math, results propagate back again and are returned by a single server. In other words, something similar to Google Search.
The problem is, what technology should I use? Cloud computing sounds obvious, but the 1500 servers need to be prepared for their task by having task-specific data available. Can this be done using any of the existing cloud computing platforms? Or should I create 1500 different cloud computing applications and upload them all?
Edit: Dedicated physical servers does not make sense, because the average load will be very, very small. Therefore, it also does not make sense, that we run the servers ourselves - it needs to be some kind of shared servers at an external provider.
Edit2: I basically want to buy 30 CPU minutes in total, and I'm willing to spend up to $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.
Edit3: I expect the solution to be something like "Use Google Apps, create 1500 apps and deploy them" or "Contact XYZ and write an asp.net script which their service can deploy, and you pay them based on the amount of CPU time you use" or something like that.
Edit4: A low-end webservice provider, offering asp.net at $1/month would actually solve the problem (!) - I could create 1500 accounts, and the latency is ok (I checked), and everything would be ok - except that I need the 1500 accounts to be on different servers, and I don't know any provider that has enough servers that is able to distribute my accounts on different servers. I am fully aware that the latency will differ from server to server, and that some may be unreliable - but that can be solved in software by retrying on different servers.
Edit5: I just tried it and benchmarked a low-end webservice provider at $1/month. They can do the node calculations and deliver results to my laptop in 15ms, if preloaded. Preloading can be done by making a request shortly before the actual performance is needed. If a node does not respond within 15ms, that node's part of the task can be distributed to a number of other servers, of which one will most likely respond within 15ms. Unfortunately, they don't have 1500 servers, and that's why I'm asking here.

[in advance, apologies to the group for using part of the response space for meta-like matters]
From the OP, Lars D:
I do not consider [this] answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.
Lars,
I sincerely apologize for reading and responding to your question at a naive and generic level. I hope you can see how both the lack of specifity in the question itself, particularly in its original form, and also the somewhat unusual nature of the problem (1) would prompt me respond to the question in like fashion. This, and the fact that such questions on SO typically emanate from hypotheticals by folks who have put but little thought and research into the process, are my excuses for believing that I, a non-practionner [of massively distributed systems], could help your quest. The many similar responses (some of which had the benefits of the extra insight you provided) and also the many remarks and additional questions addressed to you show that I was not alone with this mindset.
(1) Unsual problem: An [apparently] mostly computational process (no mention of distributed/replicated storage structures), very highly paralellizable (1,500 servers), into fifty-millisecondish-sized tasks which collectively provide a sub-second response (? for human consumption?). And yet, a process that would only be required a few times [daily..?].
Enough looking back!
In practical terms, you may consider some of the following to help improve this SO question (or move it to other/alternate questions), and hence foster the help from experts in the domain.
re-posting as a distinct (more specific) question. In fact, probably several questions: eg. on the [likely] poor latency and/or overhead of mapreduce processes, on the current prices (for specific TOS and volume details), on the rack-awareness of distributed processes at various vendors etc.
Change the title
Add details about the process you have at hand (see many questions in the notes of both the question and of many of the responses)
in some of the questions, add tags specific to a give vendor or technique (EC2, Azure...) as this my bring in the possibly not quite unbuyist but helpful all the same, commentary from agents at these companies
Show that you understand that your quest is somewhat of a tall order
Explicitly state that you wish responses from effective practionners of the underlying technologies (maybe also include folks that are "getting their feet wet" with these technologies as well, since with the exception of the physics/high-energy folks and such, who BTW traditionnaly worked with clusters rather than clouds, many of the technologies and practices are relatively new)
Also, I'll be pleased to take the hint from you (with the implicit non-veto from other folks on this page), to delete my response, if you find that doing so will help foster better responses.
-- original response--
Warning: Not all processes or mathematical calculations can readily be split in individual pieces that can then be run in parallel...
Maybe you can check Wikipedia's entry from Cloud Computing, understanding that cloud computing is however not the only architecture which allows parallel computing.
If your process/calculation can efficitively be chunked in parallelizable pieces, maybe you can look into Hadoop, or other implementations of MapReduce, for an general understanding about these parallel processes. Also, (and I believe utilizing the same or similar algorithms), there also exist commercially available frameworks such as EC2 from amazon.
Beware however that the above systems are not particularly well suited for very quick response time. They fare better with hour long (and then some) data/number crunching and similar jobs, rather than minute long calculations such as the one you wish to parallelize so it provides results in 1/10 second.
The above frameworks are generic, in a sense that they could run processes of most any nature (again, the ones that can at least in part be chunked), but there also exist various offerings for specific applications such as searching or DNA matching etc. The search applications in particular can have very short response times (cf Google for example) and BTW this is in part tied to fact that such jobs can very easily and quickly be chunked for parallel processing.

Sorry, but you are expecting too much.
The problem is that you are expecting to pay for processing power only. Yet your primary constraint is latency, and you expect that to come for free. That doesn't work out. You need to figure out what your latency budgets are.
The mere aggregating of data from multiple compute servers will take several milliseconds per level. There will be a gaussian distribution here, so with 1500 servers the slowest server will respond after 3σ. Since there's going to be a need for a hierarchy, the second level with 40 servers , where again you'll be waiting for the slowest server.
Internet roundtrips also add up quickly; that too should take 20 to 30 ms of your latency budget.
Another consideration is that these hypothethical servers will spend much of their time idle. That means they're powered on, drawing electricity yet not generating revenue. Any party with that many idle servers would turn them off, or at the very least in sleep mode just to conserve electricity.

MapReduce is not the solution! Map Reduce is used in Google, Yahoo and Microsoft for creating the indexes out of the huge data (the whole Web!) they have on their disk. This task is enormous and Map Reduce was built to make it happens in hours instead of years, but starting a Master controller of Map Reduce is already 2 seconds, so for your 100ms this is not an option.
Now, from Hadoop you may get advantages out of the distributed file system. It may allow you to distribute the tasks close to where the data is physically, but that's it. BTW: Setting up and managing an Hadoop Distributed File System means controlling your 1500 servers!
Frankly in your budget I don't see any "cloud" service that will allow you to rent 1500 servers. The only viable solution, is renting time on a Grid Computing solution like Sun and IBM are offering, but they want you to commit to hours of CPU from what I know.
BTW: On Amazon EC2 you have a new server up in a couple of minutes that you need to keep for an hour minimum!
Hope you'll find a solution!

I don't get why you would want to do that, only because "Our user interfaces generally aim to do all actions in less than 100ms, and that criteria should also apply to this".
First, 'aim to' != 'have to', its a guideline, why would u introduce these massive process just because of that. Consider 1500 ms x 100 = 150 secs = 2.5 mins. Reducing the 2.5 mins to a few seconds its a much more healthy goal. There is a place for 'we are processing your request' along with an animation.
So my answer to this is - post a modified version of the question with reasonable goals: a few secs, 30-50 servers. I don't have the answer for that one, but the question as posted here feels wrong. Could even be 6-8 multi-processor servers.

Google does it by having a gigantic farm of small Linux servers, networked together. They use a flavor of Linux that they have custom modified for their search algorithms. Costs are software development and cheap PC's.

It would seem that you are indeed expecting at least 1000-fold speedup from distributing your job to a number of computers. That may be ok. Your latency requirement seems tricky, though.
Have you considered the latencies inherent in distributing the job? Essentially the computers would have to be fairly close together in order to not run into speed of light issues. Also, the data center in which the machines would be would again have to be fairly close to your client so that you can get your request to them and back in less than 100 ms. On the same continent, at least.
Also note that any extra latency requires you to have many more nodes in the system. Losing 50% of available computing time to latency or anything else that doesn't parallelize requires you to double the computing capacity of the parallel portions just to keep up.
I doubt a cloud computing system would be the best fit for a problem like this. My impression at least is that the proponents of cloud computing would prefer to not even tell you where your machines are. Certainly I haven't seen any latency terms in the SLAs that are available.

You have conflicting requirements. You're requirement for 100ms latency is directly at odds with your desire to only run your program sporadically.
One of the characteristics of the Google-search type approach you mentioned in your question is that the latency of the cluster is dependent on the slowest node. So you could have 1499 machines respond in under 100ms, but if one machine took longer, say 1s - whether due to a retry, or because it needed to page you application in, or bad connectivity - your whole cluster would take 1s to produce an answer. It's inescapable with this approach.
The only way to achieve the kinds of latencies you're seeking would be to have all of the machines in your cluster keep your program loaded in RAM - along with all the data it needs - all of the time. Having to load your program from disk, or even having to page it in from disk, is going to take well over 100ms. As soon as one of your servers has to hit the disk, it is game over for your 100ms latency requirement.
In a shared server environment, which is what we're talking about here given your cost constraints, it is a near certainty that at least one of your 1500 servers is going to need to hit the disk in order to activate your app.
So you are either going to have to pay enough to convince someone to keep you program active and in memory at all times, or you're going to have to loosen your latency requirements.

Two trains of thought:
a) if those restraints are really, absolutely, truly founded in common sense, and doable in the way you propose in the nth edit, it seems the presupplied data is not huge. So how about trading storage for precomputation to time. How big would the table(s) be? Terabytes are cheap!
b) This sounds a lot like a employer / customer request that is not well founded in common sense. (from my experience)
Let´s assume the 15 minutes of computation time on one core. I guess thats what you say.
For a reasonable amount of money, you can buy a system with 16 proper, 32 hyperthreading cores and 48 GB RAM.
This should bring us in the 30 second range.
Add a dozen Terabytes of storage, and some precomputation.
Maybe a 10x increase is reachable there.
3 secs.
Are 3 secs too slow? If yes, why?

Sounds like you need to utilise an algorithm like MapReduce: Simplified Data Processing on Large Clusters
Wiki.

Check out Parallel computing and related articles in this WikiPedia-article - "Concurrent programming languages, libraries, APIs, and parallel programming models have been created for programming parallel computers." ... http://en.wikipedia.org/wiki/Parallel_computing

Although Cloud Computing is the cool new kid in town, your scenario sounds more like you need a cluster, i.e. how can I use parallelism to solve a problem in a shorter time.
My solution would be:
Understand that if you got a problem that can be solved in n time steps on one cpu, does not guarantee that it can be solved in n/m on m cpus. Actually n/m is the theoretical lower limit. Parallelism is usually forcing you to communicate more and therefore you'll hardly ever achieve this limit.
Parallelize your sequential algorithm, make sure it is still correct and you don't get any race conditions
Find a provider, see what he can offer you in terms of programming languages / APIs (no experience with that)

What you're asking for doesn't exist, for the simple reason that doing this would require having 1500 instances of your application (likely with substantial in-memory data) idle on 1500 machines - consuming resources on all of them. None of the existing cloud computing offerings bill on such a basis. Platforms like App Engine and Azure don't give you direct control over how your application is distributed, while platforms like Amazon's EC2 charge by the instance-hour, at a rate that would cost you over $2000 a day.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio