Dealing with tons of queries and avoiding duplicates [closed] - go

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My project involves concurrency and database management. Meaning that I have to be editing a database simultaneously between all threads. To be more specific I am reading a line from the database then inserting a line to mark that I grabbed that line. This could work with transactions but due to the fact I will be running this program on multiple machines, I will be having different database connections on each one. Is their a better way for me to accomplish my above task?

Applying Optimistic Concurrency using transactions and a version field/column (could be a time-stamp or a time-stamp plus an actual version number that just increases or other mechanism for version number) is a must here.
But since you are doing this on different machines, it's possible that a substantial amount of repetitive failed transactions occur.
To prevent this, you could use a queuing mechanism. A dispatcher program reads the non-processed records from database and dispatch them to workers - using a queue or a job dispatcher. Then each worker will take the id from that queue and process it in a transaction.
This way:
if a transaction fails, dispatcher would queue it again
if a worker goes down, other workers would continue (noticing the going down is a matter of monitoring)
workers can easily scale-out and new workers can be added at any time (as long as your database is not your bottleneck)
A request/reply schema would do best in this case to prevent queue congestion. I've used NATS successfully (and happily) for similar cases. Of-course you could use another tool of your choice but remember that you have to take care of request/reply part. Just throwing things at queues does not solve all problems and the amount of queued work should be controlled!

Related

What are the scalability benefits of splinting monolithic codebase into microservices? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Let's say that facebook runs as a single binary. Publishing N copies of this service in N servers split the workload evenly, fine. Now, if I split facebook codebase in half, exactly how it is more scalable? (in the sense of y-axis scale from this article).
If I allocate 1 server for first half and 2 servers for second half, it will certainly be faster than one monolithic server, because now we have 3. Thats exactly like x-axis scaling. Only that now you have uneven load-balancing.
But consider the servers to be 25% of its original size. Right from the startup, these servers have a higher percentage of used RAM. This is so because splitting code in half doesn't implies halving RAM footprint. Each server will be wasting more RAM on duplicated library code, etc.
I wonder if there is any benefit on using microservice from this performance/computing resource perspective.
Microservices architecture allows developers to create separate components of an application through building an application from a combination of small services...
Now... If we had a monolith application, and identify that one specific feature recive more trafic/request than other then we need to scale all application, including the parts of application with lower or zero trafic, and if this part of application crash, than all application crashes! In microservices architecture every small feature could be deployed as a small microservices, in this scenarios we can scale only the part of system that we need, and even if this service crash, all other parts of system still alive... This scenario show us two benefits: isolation and scalability.
When you think in RAM consumption, you need to take the scenario above in consideration, because we can spaw a service only when we need, and if we does not need we can reduce to a lower number of instances and here we can obtain gains in terms of performance/computing efeciency resource perspective... Looking in term of storage and still tink in footprint, a monolith application mybe use some libs to a specific group of features and another group of feature maybe use other libs... here the problem remain! Mey we had in a monolith application scalling a lot os libs and resources being scaled that we does not need to scale.
If we think in life cycle of a aplication, develop new features can be achieved developping or updating small services so, functionality associated with microservice architecture is the ability to easily understand when compared to an entire monolithic app. The microservice approach lets developers choose the right tools for the right task. They can build each server utilizing a language or framework they need without affecting the communication between other microservices. This scenarios show us more two benefits: Productivity and flexibility.

Which are the cons of a purely stream-based architecture against a Lambda architecture? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Disclaimer: I'm not a real-time architectures expert, I'd like only to throw a couple of personal considerations and evaluate what others would suggest or point out.
Let's imagine we'd like to design a real-time analytics system. Following, Lambda architecture Nathan Marz's definition, in order to serve the data we would need a batch processing layer (i.e. Hadoop), continuously recomputing views from a dataset of all the data, and a so-called speed layer (i.e. Storm) that constantly processes a subset of the views (made by the events coming in after the last full recomputation of the batch layer). You query your system by merging the results of the two together.
The rationale behind this choice makes perfect sense to me, and its a combination of software engineering and systems engineering observations. Having a ever-growing master dataset of immutable timestamped facts makes the system resilient to human errors in computing the views (if you do an error, you just fix it and recompute them in the batch layer) and enables the system to answer to virtually any query that would come up in the future. Also, such datastore would require to support only random reads and batch inserts, whereas the datastore for the speed/real-time part would require to support efficiently random reads and random writes, increasing its complexity.
My objection/trigger for a discussion about this is that, in certain scenarios, this approach might be an overkill. For the sake of discussion, assume we do a couple of simplifications:
Let's assume that in our analytics system we can define beforehand an immutable set of use-cases\queries that hour system needs to be able to provide, and that they won't change in the future.
Let's assume that we have a limited amount of resources (engineering power, infrastructure, etc) to implement it. Storing the whole set of elementary events coming to our system, instead of already precomputing views\aggregates, may just be too expensive.
Let's assume that the we succesfully minimize the impact of human mistakes (...).
The system would still need to be scalable and handle ever-increasing traffic and data.
Given these observations, I'd like to know what would stop us from designing a fully stream-oriented architecture. What I imagine is an architecture where the events (i.e. page views) are pushed inside a stream, that could be RabbitMQ + Storm or Amazon Kinesis, and where the consumers of such streams would directly update the needed views through random writes/updates to a NoSQL database (i.e. MongoDB).
At a first approximation, it looks to me that such architecture could scale horizontally. Storm can be clusterized, and Kinesis expected QoS could also be reserved upfront. More incoming events would mean more stream consumers, and as they are totally independent nothing stops us from adding more. Regarding the database, sharding it with a proper policy would let us distribute the increasing number of writes to an increasing number of shards. In order to avoid reads to be affected, each shard could have one or more read-replicas.
In terms of reliability, Kinesis promises to reliabily store your messages for up to 24 hours, and a distributed RabbitMQ (or whatever queue system of your choice) with proper usage of acknowledges' mechanisms could probably satisfy the same requirement.
Amazon's documentation on Kinesis deliberately (I believe) avoids to lock you into a specific architectural solution, but my overall impression is that they would like to push developers to simplify the Lambda architecture and arrive to a fully stream based solution similar to the one I've exposed.
To be slighly more compliant to the Lambda architecture requirements, nothing stops us from having, in parallel with the consumers constantly updating our views, a set of consumers that process the incoming events and store them as atomic immutable units in a different datastore that could be used in the future to produce new views (via Hadoop for instance) or recompute faulty data.
What's your opinion on this reasoning? I'd like to know in which scenarios a purely stream-based architecture would fail to scale up, and if you have any other observations, pros\cons of a Lambda architecture vs a stream-based architecture.

High Frequency Database to use with ruby [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to scrape a big amount of webpages (1000/second) and save 1-2 numbers from this web pages into a database. I want to manage this Workers with RabbitMQ, but I also have to write the data somewhere.
Heroku PostgreSQL has a concurrency limit of 60 requests in their cheapest production tier.
Is PostgreSQL the best solution for this job?
Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?
Is it possible to setup a Postgres Database to perform 1000 writes per second in development on my local machine?
Try it and see. If you've got an SSD, or don't require crash safety, then you almost certainly can.
You will find that with anything you choose, you have to make trade-offs with durability and write latencies.
If you want to commit each record individually, in strict order, you should be able to achieve that on a laptop with a decent SSD. You will not possibly get it on something like a cheap AWS instance, a server with a spinning rust hard drive, etc though, as they don't have good enough disk flush rates. (pg_test_fsync is a handy tool for looking at this). This will be true of anything that's doing genuine atomic commits of individual records to durable storage, not just PostgreSQL - about the best rate you're going to get is the max disk flush rate / 2 unless it's a purely append-only system, in which case the commit rate can be equal to the disk flush rate.
If you want to get higher throughput, you'll need to batch writes together and commit them in groups to spread the disk sync overhead. In the case of PostgreSQL, the commit_delay option can be useful to batch commits together. Better still, buffer a few changes client-side and do multi-valued inserts. Turning off synchronous_commit for a transaction if you don't need a hard guarantee it's committed before returning control to your program.
I haven't tested it, but expect Heroku will allow you to set both these params on your sessions using SET synchronous_commit = off or SET commit_delay = .... You should test it and see. In fact, you should do a simulated workload benchmark and see if you can make it go fast enough for your needs.
If you can't, you'll be able to use alternate hosting that will with appropriate configuration.
See also: How to speed up insertion performance in PostgreSQL
PostgreSQL is perfectly capable of handling such job. Just to give you an idea, PostgreSQL 9.2 is expected to handle up to 14.000 writes per second, but this largely depends on how you configure, design and manage the database, and on the available hardware (disk performance, RAM, etc.).
I assume the limit imposed by Heroku is to avoid potential overloads. You may want to consider an installation of PostgreSQL on a custom server or alternative solutions. For instance, Amazon recently announced the support for PostgreSQL on RDS.
Finally, I just want to mention that for the majority of standard tasks, the "best solution" is largely dependent on your knowledge. An efficiently configured MySQL is better than a badly configured PostgreSQL, and vice-versa.
I know companies that were able to reach unexpected results with a specific database by highly optimizing the setup and the configuration of the engine. There are exceptions, indeed, but I don't think they apply to your case.

Is application caching in a server farm worth the complexity? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've inherited a system where data from a SQL RDBMS that is unlikely to change is cached on a web server.
Is this a good idea? I understand the logic of it - I don't need to query the database for this data with every request because it doesn't change, so just keep it in memory and save a database call. But, I can't help but think this doesn't really give me anything. The SQL is basic. For example:
SELECT StatusId, StatusName FROM Status WHERE Active = 1
This gives me fewer than 10 records. My database is located in the same data center as my web server. Modern databases are designed to store and recall data. Is my application cache really that much more efficient than the database call?
The problem comes when I have a server farm and have to come up with a way to keep the caches in sync between the servers. Maybe I'm underestimating the cost of a database call. Is the performance benefit gained from keeping data in memory worth the complexity of keeping each server's cache synchronized when the data does change?
Benefits of caching are related to the number of times you need the cached item and the cost of getting the cached item. Your status table, even though only 10 rows long, can be "costly" to get if you have to run a query every time: establish connection, if needed, execute a query, pass data over the network, etc. If used frequently enough, the benefits could add up and be significant. Say, you need to check some status 1000 times a second or every website request, you have saved 1000 queries and your database can do something more useful and your network is not loaded with chatter. For your web server, the cost of retrieving the item from cache is usually minimal (unless you're caching tens of thousands or hundreds of thousands of items). So pulling something from the cache will be quicker than querying a database almost every time. If your database is the bottleneck of your system (which is the case in a lot of systems) then caching definitely is useful.
But bottom line is, it is hard to say yes or no without running benchmarks or knowing the details of how you're using the data. I just highlighted some of the things to consider.
There are other factors which might come into play, for example the use of EF can add considerable extra processing to a simple data retrieval. Quantity of requests, not just volume of data could be a factor.
Future design might influence your decision - perhaps the cache gets moved elsewhere and is no longer co-located.
There's no right answer to your question. In your case, maybe there is no advantage. Though there is already a disadvantage to not using a cache - you have to change existing code.

Does partial instanc-hour appear frequently in EC2 on-demand instance [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Because pricing is per instance-hour consumed for each instance, from the time an instance is launched until it is terminated. Each partial instance-hour consumed will be billed as a full hour.
Here is my question:
Does the partial instance-hour appear frequently or rarely?
Or in what kind of context, the partial instance-hour appear frequently?
Would anyone has these experiences on it?
Partial hours happen most frequently when using systems that scale often. For example, in my system I launch 10-20 servers extra each saturday and sunday to handle the extra traffic. When these servers are stopped I will be charged a partial hour. Amazon has a new feature for auto scaling groups that tells it to terminate ( if it has to ) the servers closer to the hour marker in order to save money.
Other possible uses are for services like MapReduce where a large number of instances will be started and then when the job is complete they will be terminated.
My experiences though is that the actual cost of partial hours is insignificant for me. Maybe if you're using larger servers it costs a lot but i'm using the c1.medium and i barely notice the $5 i get charged on a weekend for partial hours.

Resources