What are the scalability benefits of splinting monolithic codebase into microservices? [closed] - microservices

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Let's say that facebook runs as a single binary. Publishing N copies of this service in N servers split the workload evenly, fine. Now, if I split facebook codebase in half, exactly how it is more scalable? (in the sense of y-axis scale from this article).
If I allocate 1 server for first half and 2 servers for second half, it will certainly be faster than one monolithic server, because now we have 3. Thats exactly like x-axis scaling. Only that now you have uneven load-balancing.
But consider the servers to be 25% of its original size. Right from the startup, these servers have a higher percentage of used RAM. This is so because splitting code in half doesn't implies halving RAM footprint. Each server will be wasting more RAM on duplicated library code, etc.
I wonder if there is any benefit on using microservice from this performance/computing resource perspective.

Microservices architecture allows developers to create separate components of an application through building an application from a combination of small services...
Now... If we had a monolith application, and identify that one specific feature recive more trafic/request than other then we need to scale all application, including the parts of application with lower or zero trafic, and if this part of application crash, than all application crashes! In microservices architecture every small feature could be deployed as a small microservices, in this scenarios we can scale only the part of system that we need, and even if this service crash, all other parts of system still alive... This scenario show us two benefits: isolation and scalability.
When you think in RAM consumption, you need to take the scenario above in consideration, because we can spaw a service only when we need, and if we does not need we can reduce to a lower number of instances and here we can obtain gains in terms of performance/computing efeciency resource perspective... Looking in term of storage and still tink in footprint, a monolith application mybe use some libs to a specific group of features and another group of feature maybe use other libs... here the problem remain! Mey we had in a monolith application scalling a lot os libs and resources being scaled that we does not need to scale.
If we think in life cycle of a aplication, develop new features can be achieved developping or updating small services so, functionality associated with microservice architecture is the ability to easily understand when compared to an entire monolithic app. The microservice approach lets developers choose the right tools for the right task. They can build each server utilizing a language or framework they need without affecting the communication between other microservices. This scenarios show us more two benefits: Productivity and flexibility.

Related

Data communication in a scalable microservice architecture

We are working on a project to gain some knowledge about microservices and automatically scalable architectures. In this project we are building a small game where a user can fly a plane and shoot down other players online, hosted on the Amazon Web Services. The duration of a game should be about 10 minutes, a million games should (theoretically) be able to be played at the same time and about a thousand players should be able to play in a single game. So the application must really be scalable.
We are now hitting a hard part in the architecture. We want the server to calculate the positions of the players. Meaning that server gets key input requests with which it recalculates positions. Problem is that, because the application is scalable and there isn't just one server doing all the calculations and holding all the data, the input events will probably end up in different locations. We expect that constantly writing all positions to a database and reading it back to the client is too slow nor scalable enough. Also we don't want dedicated servers for single games as that could just waist the computation power (and money)
We have searches for different implementations by other game architectures, like messaging, but to no avail, we could not find any method that seamed fitting. We would like to know if there is any kind of pattern that could make this kind of implementation work? All we really need is a sense of direction for some possible patterns.
Try ElasticCache http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
This makes it easy to share positions between nodes
They discuss using it for a score table but it might be possible to use it for positional data
Combine ElasticCache with autoscaling http://docs.aws.amazon.com/autoscaling/latest/userguide/WhatIsAutoScaling.html and you should be able to expand the environment with demand
Your example sounds like a prime use case for a streaming platform such as Apache Kafka. It is a scalable cluster itself and acts as a large queue of events (your game inputs) that are stored and made available to stream consumers (all your game servers). This has a very high performance and should be able to handle millions of inputs per second with a low latency.
You should also make sure to split up your game world into broader "zones" as to make sure that not every server requires the data from all others always. I'm sure that no player has all other players on his screen at any point in time.
Look into the Kafka examples
And the performance measurements with comparison to traditional DBs.

Which are the cons of a purely stream-based architecture against a Lambda architecture? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Disclaimer: I'm not a real-time architectures expert, I'd like only to throw a couple of personal considerations and evaluate what others would suggest or point out.
Let's imagine we'd like to design a real-time analytics system. Following, Lambda architecture Nathan Marz's definition, in order to serve the data we would need a batch processing layer (i.e. Hadoop), continuously recomputing views from a dataset of all the data, and a so-called speed layer (i.e. Storm) that constantly processes a subset of the views (made by the events coming in after the last full recomputation of the batch layer). You query your system by merging the results of the two together.
The rationale behind this choice makes perfect sense to me, and its a combination of software engineering and systems engineering observations. Having a ever-growing master dataset of immutable timestamped facts makes the system resilient to human errors in computing the views (if you do an error, you just fix it and recompute them in the batch layer) and enables the system to answer to virtually any query that would come up in the future. Also, such datastore would require to support only random reads and batch inserts, whereas the datastore for the speed/real-time part would require to support efficiently random reads and random writes, increasing its complexity.
My objection/trigger for a discussion about this is that, in certain scenarios, this approach might be an overkill. For the sake of discussion, assume we do a couple of simplifications:
Let's assume that in our analytics system we can define beforehand an immutable set of use-cases\queries that hour system needs to be able to provide, and that they won't change in the future.
Let's assume that we have a limited amount of resources (engineering power, infrastructure, etc) to implement it. Storing the whole set of elementary events coming to our system, instead of already precomputing views\aggregates, may just be too expensive.
Let's assume that the we succesfully minimize the impact of human mistakes (...).
The system would still need to be scalable and handle ever-increasing traffic and data.
Given these observations, I'd like to know what would stop us from designing a fully stream-oriented architecture. What I imagine is an architecture where the events (i.e. page views) are pushed inside a stream, that could be RabbitMQ + Storm or Amazon Kinesis, and where the consumers of such streams would directly update the needed views through random writes/updates to a NoSQL database (i.e. MongoDB).
At a first approximation, it looks to me that such architecture could scale horizontally. Storm can be clusterized, and Kinesis expected QoS could also be reserved upfront. More incoming events would mean more stream consumers, and as they are totally independent nothing stops us from adding more. Regarding the database, sharding it with a proper policy would let us distribute the increasing number of writes to an increasing number of shards. In order to avoid reads to be affected, each shard could have one or more read-replicas.
In terms of reliability, Kinesis promises to reliabily store your messages for up to 24 hours, and a distributed RabbitMQ (or whatever queue system of your choice) with proper usage of acknowledges' mechanisms could probably satisfy the same requirement.
Amazon's documentation on Kinesis deliberately (I believe) avoids to lock you into a specific architectural solution, but my overall impression is that they would like to push developers to simplify the Lambda architecture and arrive to a fully stream based solution similar to the one I've exposed.
To be slighly more compliant to the Lambda architecture requirements, nothing stops us from having, in parallel with the consumers constantly updating our views, a set of consumers that process the incoming events and store them as atomic immutable units in a different datastore that could be used in the future to produce new views (via Hadoop for instance) or recompute faulty data.
What's your opinion on this reasoning? I'd like to know in which scenarios a purely stream-based architecture would fail to scale up, and if you have any other observations, pros\cons of a Lambda architecture vs a stream-based architecture.

High Frequency Database to use with ruby [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to scrape a big amount of webpages (1000/second) and save 1-2 numbers from this web pages into a database. I want to manage this Workers with RabbitMQ, but I also have to write the data somewhere.
Heroku PostgreSQL has a concurrency limit of 60 requests in their cheapest production tier.
Is PostgreSQL the best solution for this job?
Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?
Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?
Try it and see. If you've got an SSD, or don't require crash safety, then you almost certainly can.
You will find that with anything you choose, you have to make trade-offs with durability and write latencies.
If you want to commit each record individually, in strict order, you should be able to achieve that on a laptop with a decent SSD. You will not possibly get it on something like a cheap AWS instance, a server with a spinning rust hard drive, etc though, as they don't have good enough disk flush rates. (pg_test_fsync is a handy tool for looking at this). This will be true of anything that's doing genuine atomic commits of individual records to durable storage, not just PostgreSQL - about the best rate you're going to get is the max disk flush rate / 2 unless it's a purely append-only system, in which case the commit rate can be equal to the disk flush rate.
If you want to get higher throughput, you'll need to batch writes together and commit them in groups to spread the disk sync overhead. In the case of PostgreSQL, the commit_delay option can be useful to batch commits together. Better still, buffer a few changes client-side and do multi-valued inserts. Turning off synchronous_commit for a transaction if you don't need a hard guarantee it's committed before returning control to your program.
I haven't tested it, but expect Heroku will allow you to set both these params on your sessions using SET synchronous_commit = off or SET commit_delay = .... You should test it and see. In fact, you should do a simulated workload benchmark and see if you can make it go fast enough for your needs.
If you can't, you'll be able to use alternate hosting that will with appropriate configuration.
See also: How to speed up insertion performance in PostgreSQL
PostgreSQL is perfectly capable of handling such job. Just to give you an idea, PostgreSQL 9.2 is expected to handle up to 14.000 writes per second, but this largely depends on how you configure, design and manage the database, and on the available hardware (disk performance, RAM, etc.).
I assume the limit imposed by Heroku is to avoid potential overloads. You may want to consider an installation of PostgreSQL on a custom server or alternative solutions. For instance, Amazon recently announced the support for PostgreSQL on RDS.
Finally, I just want to mention that for the majority of standard tasks, the "best solution" is largely dependent on your knowledge. An efficiently configured MySQL is better than a badly configured PostgreSQL, and vice-versa.
I know companies that were able to reach unexpected results with a specific database by highly optimizing the setup and the configuration of the engine. There are exceptions, indeed, but I don't think they apply to your case.

In-Memory Data Grid for Java Project [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking to use an in memory data grid for my java project. I know there are a few relevant products such as VMWare GemFire, GigaSpaces XAP, IBM eXtreme Scale and others. Can someone elaborate from their experience with any of these tools and how they compare to one another? Thanks, Alex
(disclaimer - I work for GigaSpaces)
Hi Alex
There are many criteria to compare by, it really depends on what you're trying to do. in memory data grid have a lot of use cases, e.g. caching, OLTP, high throughput event processing, etc.
In general, the main criteria you should be looking at are:
Programming model: Support popular Java frameworks such as Spring (XAP and Gemfire support it natively)
Querying and indexing: if you want more than trivial key/value data access. Most people need SQL like semantics, or even full text search, and if the data grid can provide that out of the box it's a big advantage.
Ability to execute code on the grid nodes, and even colocate your code with them and handle events that are injected to the grid (e.g. objects written or updated). This is a massive scalability benefit and allows you to implement very efficient shared-nothing architectures.
Languages and APIs support: Most data grid support at least Java and JVM based languages (e.g. Scala), but a lot of them also support other languages and allow you to access the same data from various programming languages. For example XAP supports natively Java, .Net and C++, and other languages using its REST and memcached interfaces. As far as APIs go, some grid support more than one API. At GigaSpaces we support Map, Spring/POJO, JPA, JDBC and others.
Transactions: This is also a big one if you want to go anywhere beyond caching. When using the memory as your system of record, you should be able to rollback state in case you have an error or a bug, otherwise you end up with corrupt data. Another important thing is what types of transactions are supported. A lot of data grids only support "local" transactions. i.e. within the boundaries of a single node / partition / shard (which is probably what you want to do in most cases for performance reason). But more advanced grids also support distributed transactions and know how to seamlessly upgrade from local to distributed when needed.
Replication: there are various models here (synchronous, asynchronous, hybrid) and you need to decide which one of them is best for your use case. Some grids also have explicit support for cross cluster replication over WAN which is important if you're implementing DR.
Data partitioning and scalability: how does the grid partition data (fixed / consistent hashing), what is the level of control the user has over it, and does it support dynamic addition of server to the grid to increase capacity.
Administration and monitoring: Last but not least - what kind of facilities are provided out of the box, such as monitoring and administration hooks (JMX or another administrative API), user interfaces and integration with other 3rd party systems.
The following links are a good place to start:
http://gojko.net/2009/06/01/oracle-coherence-vs-gigaspaces-xap/. Also read the comments
http://www.neovise.com/neovise-data-caching-performance-technical-white-paper - a recent comparison between GigaSpaces and GemFire which I think speaks for itself :)
HTH,
Uri
There's a summary of the In-Memory Data Grid market by Gartner called "Competitive Landscape: In-Memory Data Grids". You can see a copy at: http://www.gartner.com/technology/reprints.do?id=1-1HCCIMJ&ct=130718&st=sb
Oracle Coherence (previously Tangosol Coherence) is and has long been the market leader in this space, but it is a commercial product. As I explained elsewhere recently:
Pros:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
RESTful APIs available from any language. Native APIs and client libraries for C/C++, C#, .NET and Java.
In addition to simple key-value (K/V) caching, also support queries (including some SQL), parallel queries, indexes (including custom indexes), a rich eventing model (for event-driven systems like exchanges), transactions (including MVCC), parallel execution of both scalar (EntryProcessor) and aggregate (ParallelAwareAggregator) functions, cache triggers, etc.
Easy to integrate with a database via read-through, read-ahead, write-through and write-behind caching. Automatically refreshes just the changed data when changes occur to the database (leveraging Oracle GoldenGate technology).
Coherence Incubator: The Coherence Incubator
memcache protocol support: Memcached interface for Oracle Coherence project on github
Thousands of customers, some using Coherence in production now for over a decade.
Cons:
As of Coherence 12.1.2, the cache itself is not persistent.
It costs money.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Performance impact of having a data access layer/service layer?

I need to design a system which has these basic components:
A Webserver which will be getting ~100 requests/sec. The webserver only needs to dump data into raw data repository.
Raw data repository which has a single table which gets 100 rows/s from the webserver.
A raw data processing unit (Simple processing, not much. Removing invalid raw data, inserting missing components into damaged raw data etc.)
Processed data repository
Does it make sense in such a system to have a service layer on which all components would be built? All inter-component interaction will go through the service layers. While this would make the system easily upgradeable and maintainable, would it not also have a significant performance impact since I have so much traffic to handle?
Here's what can happen unless you guard against it.
In the communication between layers, some format is chosen, like XML. Then you build it and run it and find out the performance is not satisfactory.
Then you mess around with profilers which leave you guessing what the problem is.
When I worked on a problem like this, I used the stackshot technique and quickly found the problem. You would have thought it was I/O. NOT. It was that converting data to XML, and parsing XML to recover data structure, was taking roughly 80% of the time. It wasn't too hard to find a better way to do that. Result - a 5x speedup.
What do you see as the costs of having a separate service layer?
How do those costs compare with the costs you must incur? In your case that seems to be at least
a network read for the request
a database write for raw data
a database read of raw data
a database write of processed data
Plus some data munging.
What sort of services do you have a mind? Perhaps
saveRawData()
getNextRawData()
writeProcessedData()
why is the overhead any more than a procedure call? Service does not need to imply "separate process" or "web service marshalling".
I contend that structure is always of value, separation of concerns in your application really matters. In comparison with database activities a few procedure calls will rarely cost much.
In passing: the persisting of Raw data might best be done to a queuing system. You can then get some natural scaling by having many queue readers on separate machines if you need them. In effect the queueing system is naturally introducing some service-like concepts.
Personally feel that you might be focusing too much on low level implementation details when designing the system. Before looking at how to lay out the components, assemblies or services you should be thinking of how to architect the system.
You could start with the following high level statements from which to build your system architecture around:
Confirm the technical skill set of the development team and the operations/support team.
Agree on an initial finite list of systems that will integrate to your service, the protocols they support and some SLAs.
Decide on the messaging strategy.
Understand how you will deploy your service/system.
Decide on the choice of middleware (ESBs, Message Brokers, etc), databases (SQL, Oracle, Memcache, DB2, etc) and 3rd party frameworks/tools.
Decide on your caching and data latency strategy.
Break your application into the various areas of business responsibility - This will allow you to split up the work and allow easier communication of milestones during development/testing and implementation.
Design each component as required to meet the areas of responsibility. The areas of responsibility should automatically lead you to decide on how to design component, assembly or service.
Obviously not all of the above will match your specific case but I would suggest that they should at least be given some thought.
Good luck.
Abstraction and tiering will introduce latency, but the real question is, what are you GAINING to make the cost(s) worthwhile? Loose coupling, governance, scalability, maintainability are worth real $.
Even the best-designed layered app will exhibit more latency than an app talking directly to a DB. Users who know the original system will feel the difference. They may not like it, so this can be a political issue as much as a technical one.

Resources