Service architecture using technologies which provide parallelism and high scalability - hadoop

I'm working on a booking system with a single RDBMS. This system has units (products) with several characteristics (attributes) like: location, size [m2], has sea view, has air conditioner…
On the top of that there is pricing with its prices for different periods e.g. 1/1/2018 – 1/4/2018 -> 30$ ... Also, there is capacity with its own periods 1/8/2017 – 1/6/2018… Availability which is the same as capacity.
Each price can have its own type: per person, per stay, per item… There are restrictions for different age groups, extra bed, …
We are talking about 100k potential units. The end user can make request to search all units in several countries, for two adults and children of 3 and 7 years, for period 1/1/2018 – 1/8/2018, where are 2 rooms with one king size bed and one single bed + one extra bed. Also, there can be other rules which are handled by rule engine.
In classical approach filtering would be done in several iterations, trying to eliminate as much as possible in each iteration. There could be done several tables with semi results which must be synchronized with every change when something has been changed through administration.
Recently I was reading about Hadoop and Storm which are highly scalable and provide parallelism. I was wondering if this kind of technology is suitable for solving described problem. Main idea is to write “one method” which validates each unit, if satisfies given filter search. Later this function is easy to extend with additional logic. Each cluster could take its own portion of the load. If there are 10 cluster, each of them could process 10k units.
In Cloudera tutorial there is a moment when with Sqoop, content from RDBMS has been transferred to HDFS. This process takes some time, so it seems it’s not a good approach to solve this problem. Given problem is highly deterministic and it requires to have immediate synchronization and to operates with fresh data. Maybe to use in some streaming service and to parallelly write into HDFS and RDBMS? Do you recommend some other technology like Storm?
What could be possible architecture, starting point, to satisfy all requirements to solve this problem.
Please point me into right direction if this problem is improper for the site.

Related

How does Anomaly detection in Elasticsearch work?

I have a question about the Anomaly Detection module provided by elastic stack. As per my understanding of Machine Learning the more data being fed to the model the better learning it will do provided the data is proper. Now I want to use the Anomaly Detection Module in kibana. I did some testing with that and with some reading I found that basically it is better that we have at least 3 weeks of data or 20 buckets worth. Now lets say we receive about 40 million records a day. This will take a whole lot of time for the model to train for a day itself now if get about 3 weeks worth of this amount of data this will put a lot of pressure on the node. But if I feed the model less data and reduce the bucket span it will make my model more sensitive. So what is my best bet for this. How is that I can make the most out of the Anomaly Detection module.
Just FYI: I do have a dedicated Machine learning Node with equipped with more than enough memory but it still takes a whole lot of time to process records for a day so my concern is it will take a whole whole lot of time to process 3 weeks worth of data.
So My question is that if we give large amount of data for short amount of time say 1 week to the model for training and if we give large amount of data for a slightly longer amount of time say 3 weeks to the model for training will these two models detect anomalies with the same accuracy.
If you have a dedicated ML node with ample memory, I don't see what the problem could be. Common sense has it that the more data you have, the better the model can learn, and the more accurate your prediction model will be. Also seasonality might not be well captured with just one week of data. If you have the data and are not using it out of fear that it will take a "some time" to analyze it, what's the point of gathering it in the first place?
It is true that it will take "some time" to build the model initially, but afterwards, the ML process will run more frequently depending on your chosen bucket span size (configurable) and process the new documents that arrived in the meantime, it's really fast. Regarding sensitivity, your mileage may vary, but it's not dependent only on the amount of data you feed, but also on the size of the bucket span you choose.

Data communication in a scalable microservice architecture

We are working on a project to gain some knowledge about microservices and automatically scalable architectures. In this project we are building a small game where a user can fly a plane and shoot down other players online, hosted on the Amazon Web Services. The duration of a game should be about 10 minutes, a million games should (theoretically) be able to be played at the same time and about a thousand players should be able to play in a single game. So the application must really be scalable.
We are now hitting a hard part in the architecture. We want the server to calculate the positions of the players. Meaning that server gets key input requests with which it recalculates positions. Problem is that, because the application is scalable and there isn't just one server doing all the calculations and holding all the data, the input events will probably end up in different locations. We expect that constantly writing all positions to a database and reading it back to the client is too slow nor scalable enough. Also we don't want dedicated servers for single games as that could just waist the computation power (and money)
We have searches for different implementations by other game architectures, like messaging, but to no avail, we could not find any method that seamed fitting. We would like to know if there is any kind of pattern that could make this kind of implementation work? All we really need is a sense of direction for some possible patterns.
Try ElasticCache http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
This makes it easy to share positions between nodes
They discuss using it for a score table but it might be possible to use it for positional data
Combine ElasticCache with autoscaling http://docs.aws.amazon.com/autoscaling/latest/userguide/WhatIsAutoScaling.html and you should be able to expand the environment with demand
Your example sounds like a prime use case for a streaming platform such as Apache Kafka. It is a scalable cluster itself and acts as a large queue of events (your game inputs) that are stored and made available to stream consumers (all your game servers). This has a very high performance and should be able to handle millions of inputs per second with a low latency.
You should also make sure to split up your game world into broader "zones" as to make sure that not every server requires the data from all others always. I'm sure that no player has all other players on his screen at any point in time.
Look into the Kafka examples
And the performance measurements with comparison to traditional DBs.

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

What distributed message queues support millions of queues?

I'm looking for a distributed message queue that will support millions of queues, with each queue handling tens of messages per second.
The messages will be small (tens of bytes), and I don't expect the queues to get very long--on the order of tens of messages per queue at maximum, but when the system is humming along, the queues should stay fairly empty.
I'm not sure how many nodes to expect in the cluster--probably depends on the specific solution, but if I had to guess, I would say ten nodes. I would prefer that queues were relatively resilient to individual node failures within the cluster, but a few lost messages here and there won't make me lose sleep.
Does such a message queue exist? Seems like most of the field is optimized toward handling hundreds of queues with high throughput. But what is SQS built on? Surely not magic.
Update:
By request, it may indeed help to shed light on my problem domain. (I'd left details out before so as not to muddy the waters.) I'm experimenting with distributed cellular automata, with an initial target of a million cells in simulation. In some CA models, it's useful to add an event model, so that a cell can send events to its neighbors. Hence, a million queues, each with one consumer and 8 or so producers.
Costs are a concern for now, as I'm funding the experiments myself. (Thus Amazon's SQS is probably out of reach.)
From your description, it looks like OMG's Data Distribution Service could be a good fit. It is related to message queueing technologies, but I would rather call it a distributed data management infrastructure. It is completely distributed and supports advanced features that give you a lot of control over how the data is distributed, by means of a rich set of Quality of Service settings.
Not knowing much about your problem, I could guess what an approach might be. DDS is about distributing the state of strongly-typed data-items, as structures with typed attributes. You could create a data-type describing the state of an automaton. One of its attributes could be an ID uniquely identifying the automaton in the system. If possible, that would be assigned according to a scheme such that every automaton knows what the ID's of its neighbors are (if they are present). Each automaton would publish its state as needed, resulting in a distributed data-space containing the current state of all automatons. DDS supports so-called partitioning of that data-space. If you took advantage of that, then each of the nodes in your machine would be responsible for a well-defined subset of all automatons. Communication over the wire would only happen for those automatons neighboring a different partition. Since automatons know the ID's of their neighbors, they would be able to query the data-space for the states of the automatons it's interested in.
It is a bit hard to explain without a white board, but the end-result would be a single instance (which are a sort of very light-weight message queues) for most automatons, and two or three instances for those automatons at the border of a partition. If you had ten nodes and one million automata, then each node would have to be able to hold administration for approximately hundred thousand automata. I have seen systems being built with DDS of that scale, and larger, with tens of updates per second for each instance. The nice thing is that this technology scales well with the number of nodes, so you could bring down the resource load per node by adding more nodes.
If this is a research project, then you might even be able to use a commercial product without charge. Just google on dds research license.

How to create a system with 1500 servers that deliver results instantaneously?

I want to create a system that delivers user interface response within 100ms, but which requires minutes of computation. Fortunately, I can divide it up into very small pieces, so that I could distribute this to a lot of servers, let's say 1500 servers. The query would be delivered to one of them, which then redistributes to 10-100 other servers, which then redistribute etc., and after doing the math, results propagate back again and are returned by a single server. In other words, something similar to Google Search.
The problem is, what technology should I use? Cloud computing sounds obvious, but the 1500 servers need to be prepared for their task by having task-specific data available. Can this be done using any of the existing cloud computing platforms? Or should I create 1500 different cloud computing applications and upload them all?
Edit: Dedicated physical servers does not make sense, because the average load will be very, very small. Therefore, it also does not make sense, that we run the servers ourselves - it needs to be some kind of shared servers at an external provider.
Edit2: I basically want to buy 30 CPU minutes in total, and I'm willing to spend up to $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.
Edit3: I expect the solution to be something like "Use Google Apps, create 1500 apps and deploy them" or "Contact XYZ and write an asp.net script which their service can deploy, and you pay them based on the amount of CPU time you use" or something like that.
Edit4: A low-end webservice provider, offering asp.net at $1/month would actually solve the problem (!) - I could create 1500 accounts, and the latency is ok (I checked), and everything would be ok - except that I need the 1500 accounts to be on different servers, and I don't know any provider that has enough servers that is able to distribute my accounts on different servers. I am fully aware that the latency will differ from server to server, and that some may be unreliable - but that can be solved in software by retrying on different servers.
Edit5: I just tried it and benchmarked a low-end webservice provider at $1/month. They can do the node calculations and deliver results to my laptop in 15ms, if preloaded. Preloading can be done by making a request shortly before the actual performance is needed. If a node does not respond within 15ms, that node's part of the task can be distributed to a number of other servers, of which one will most likely respond within 15ms. Unfortunately, they don't have 1500 servers, and that's why I'm asking here.
[in advance, apologies to the group for using part of the response space for meta-like matters]
From the OP, Lars D:
I do not consider [this] answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.
Lars,
I sincerely apologize for reading and responding to your question at a naive and generic level. I hope you can see how both the lack of specifity in the question itself, particularly in its original form, and also the somewhat unusual nature of the problem (1) would prompt me respond to the question in like fashion. This, and the fact that such questions on SO typically emanate from hypotheticals by folks who have put but little thought and research into the process, are my excuses for believing that I, a non-practionner [of massively distributed systems], could help your quest. The many similar responses (some of which had the benefits of the extra insight you provided) and also the many remarks and additional questions addressed to you show that I was not alone with this mindset.
(1) Unsual problem: An [apparently] mostly computational process (no mention of distributed/replicated storage structures), very highly paralellizable (1,500 servers), into fifty-millisecondish-sized tasks which collectively provide a sub-second response (? for human consumption?). And yet, a process that would only be required a few times [daily..?].
Enough looking back!
In practical terms, you may consider some of the following to help improve this SO question (or move it to other/alternate questions), and hence foster the help from experts in the domain.
re-posting as a distinct (more specific) question. In fact, probably several questions: eg. on the [likely] poor latency and/or overhead of mapreduce processes, on the current prices (for specific TOS and volume details), on the rack-awareness of distributed processes at various vendors etc.
Change the title
Add details about the process you have at hand (see many questions in the notes of both the question and of many of the responses)
in some of the questions, add tags specific to a give vendor or technique (EC2, Azure...) as this my bring in the possibly not quite unbuyist but helpful all the same, commentary from agents at these companies
Show that you understand that your quest is somewhat of a tall order
Explicitly state that you wish responses from effective practionners of the underlying technologies (maybe also include folks that are "getting their feet wet" with these technologies as well, since with the exception of the physics/high-energy folks and such, who BTW traditionnaly worked with clusters rather than clouds, many of the technologies and practices are relatively new)
Also, I'll be pleased to take the hint from you (with the implicit non-veto from other folks on this page), to delete my response, if you find that doing so will help foster better responses.
-- original response--
Warning: Not all processes or mathematical calculations can readily be split in individual pieces that can then be run in parallel...
Maybe you can check Wikipedia's entry from Cloud Computing, understanding that cloud computing is however not the only architecture which allows parallel computing.
If your process/calculation can efficitively be chunked in parallelizable pieces, maybe you can look into Hadoop, or other implementations of MapReduce, for an general understanding about these parallel processes. Also, (and I believe utilizing the same or similar algorithms), there also exist commercially available frameworks such as EC2 from amazon.
Beware however that the above systems are not particularly well suited for very quick response time. They fare better with hour long (and then some) data/number crunching and similar jobs, rather than minute long calculations such as the one you wish to parallelize so it provides results in 1/10 second.
The above frameworks are generic, in a sense that they could run processes of most any nature (again, the ones that can at least in part be chunked), but there also exist various offerings for specific applications such as searching or DNA matching etc. The search applications in particular can have very short response times (cf Google for example) and BTW this is in part tied to fact that such jobs can very easily and quickly be chunked for parallel processing.
Sorry, but you are expecting too much.
The problem is that you are expecting to pay for processing power only. Yet your primary constraint is latency, and you expect that to come for free. That doesn't work out. You need to figure out what your latency budgets are.
The mere aggregating of data from multiple compute servers will take several milliseconds per level. There will be a gaussian distribution here, so with 1500 servers the slowest server will respond after 3σ. Since there's going to be a need for a hierarchy, the second level with 40 servers , where again you'll be waiting for the slowest server.
Internet roundtrips also add up quickly; that too should take 20 to 30 ms of your latency budget.
Another consideration is that these hypothethical servers will spend much of their time idle. That means they're powered on, drawing electricity yet not generating revenue. Any party with that many idle servers would turn them off, or at the very least in sleep mode just to conserve electricity.
MapReduce is not the solution! Map Reduce is used in Google, Yahoo and Microsoft for creating the indexes out of the huge data (the whole Web!) they have on their disk. This task is enormous and Map Reduce was built to make it happens in hours instead of years, but starting a Master controller of Map Reduce is already 2 seconds, so for your 100ms this is not an option.
Now, from Hadoop you may get advantages out of the distributed file system. It may allow you to distribute the tasks close to where the data is physically, but that's it. BTW: Setting up and managing an Hadoop Distributed File System means controlling your 1500 servers!
Frankly in your budget I don't see any "cloud" service that will allow you to rent 1500 servers. The only viable solution, is renting time on a Grid Computing solution like Sun and IBM are offering, but they want you to commit to hours of CPU from what I know.
BTW: On Amazon EC2 you have a new server up in a couple of minutes that you need to keep for an hour minimum!
Hope you'll find a solution!
I don't get why you would want to do that, only because "Our user interfaces generally aim to do all actions in less than 100ms, and that criteria should also apply to this".
First, 'aim to' != 'have to', its a guideline, why would u introduce these massive process just because of that. Consider 1500 ms x 100 = 150 secs = 2.5 mins. Reducing the 2.5 mins to a few seconds its a much more healthy goal. There is a place for 'we are processing your request' along with an animation.
So my answer to this is - post a modified version of the question with reasonable goals: a few secs, 30-50 servers. I don't have the answer for that one, but the question as posted here feels wrong. Could even be 6-8 multi-processor servers.
Google does it by having a gigantic farm of small Linux servers, networked together. They use a flavor of Linux that they have custom modified for their search algorithms. Costs are software development and cheap PC's.
It would seem that you are indeed expecting at least 1000-fold speedup from distributing your job to a number of computers. That may be ok. Your latency requirement seems tricky, though.
Have you considered the latencies inherent in distributing the job? Essentially the computers would have to be fairly close together in order to not run into speed of light issues. Also, the data center in which the machines would be would again have to be fairly close to your client so that you can get your request to them and back in less than 100 ms. On the same continent, at least.
Also note that any extra latency requires you to have many more nodes in the system. Losing 50% of available computing time to latency or anything else that doesn't parallelize requires you to double the computing capacity of the parallel portions just to keep up.
I doubt a cloud computing system would be the best fit for a problem like this. My impression at least is that the proponents of cloud computing would prefer to not even tell you where your machines are. Certainly I haven't seen any latency terms in the SLAs that are available.
You have conflicting requirements. You're requirement for 100ms latency is directly at odds with your desire to only run your program sporadically.
One of the characteristics of the Google-search type approach you mentioned in your question is that the latency of the cluster is dependent on the slowest node. So you could have 1499 machines respond in under 100ms, but if one machine took longer, say 1s - whether due to a retry, or because it needed to page you application in, or bad connectivity - your whole cluster would take 1s to produce an answer. It's inescapable with this approach.
The only way to achieve the kinds of latencies you're seeking would be to have all of the machines in your cluster keep your program loaded in RAM - along with all the data it needs - all of the time. Having to load your program from disk, or even having to page it in from disk, is going to take well over 100ms. As soon as one of your servers has to hit the disk, it is game over for your 100ms latency requirement.
In a shared server environment, which is what we're talking about here given your cost constraints, it is a near certainty that at least one of your 1500 servers is going to need to hit the disk in order to activate your app.
So you are either going to have to pay enough to convince someone to keep you program active and in memory at all times, or you're going to have to loosen your latency requirements.
Two trains of thought:
a) if those restraints are really, absolutely, truly founded in common sense, and doable in the way you propose in the nth edit, it seems the presupplied data is not huge. So how about trading storage for precomputation to time. How big would the table(s) be? Terabytes are cheap!
b) This sounds a lot like a employer / customer request that is not well founded in common sense. (from my experience)
Let´s assume the 15 minutes of computation time on one core. I guess thats what you say.
For a reasonable amount of money, you can buy a system with 16 proper, 32 hyperthreading cores and 48 GB RAM.
This should bring us in the 30 second range.
Add a dozen Terabytes of storage, and some precomputation.
Maybe a 10x increase is reachable there.
3 secs.
Are 3 secs too slow? If yes, why?
Sounds like you need to utilise an algorithm like MapReduce: Simplified Data Processing on Large Clusters
Wiki.
Check out Parallel computing and related articles in this WikiPedia-article - "Concurrent programming languages, libraries, APIs, and parallel programming models have been created for programming parallel computers." ... http://en.wikipedia.org/wiki/Parallel_computing
Although Cloud Computing is the cool new kid in town, your scenario sounds more like you need a cluster, i.e. how can I use parallelism to solve a problem in a shorter time.
My solution would be:
Understand that if you got a problem that can be solved in n time steps on one cpu, does not guarantee that it can be solved in n/m on m cpus. Actually n/m is the theoretical lower limit. Parallelism is usually forcing you to communicate more and therefore you'll hardly ever achieve this limit.
Parallelize your sequential algorithm, make sure it is still correct and you don't get any race conditions
Find a provider, see what he can offer you in terms of programming languages / APIs (no experience with that)
What you're asking for doesn't exist, for the simple reason that doing this would require having 1500 instances of your application (likely with substantial in-memory data) idle on 1500 machines - consuming resources on all of them. None of the existing cloud computing offerings bill on such a basis. Platforms like App Engine and Azure don't give you direct control over how your application is distributed, while platforms like Amazon's EC2 charge by the instance-hour, at a rate that would cost you over $2000 a day.

Resources