I saw that rethinkdb has real-time capabilities which made me think it would be great for a chat application - however I saw the caveat in the rethink website that says apps requiring high write throughput should consider riak instead.
What is this limit for write that it is mentioning, and is it still suitable for a standard chat application that would support many thousands of concurrent users?
RethinkDB is a good choice for a chat application. In fact, its realtime changefeeds are specifically designed to make it easy to build these kinds of realtime applications.
The FAQ actually states:
In some cases RethinkDB trades off write availability in favor of data consistency. If high write availability is critical and you don’t mind dealing with conflicts you may be better off with a Dynamo-style system like Riak.
Write availability is not the same as write throughput. RethinkDB's write throughput is more than capable of handling thousands of concurrent users (most databases will do fine in this respect).
Regarding write availability: RethinkDB favors consistency, whereas Riak favors availability. This set of tradeoffs is commonly referred to as the CAP theorem, which states that in one distributed system it is impossible to achieve all three properies: consistency, availability, and partition tolerance.
You can read more about what this means in the RethinkDB architecture FAQ.
Related
Does Marklogic supports backpressure or allow to send data in chunks that is reactive approach ?
'Reactive' is a fairly new term describing a particular incarnation of old concepts common in server and database technologies, but fairly new to modern client and middle-tier programming.
I am assuming the question is prompted by the need/desire to work within an existing 'Reactive' framework (such as vert.x or Rx/Java). For that question, the answer is 'no' - there is not an 'official' API which integrates directly with these frameworks to my knowledge. There are community APIs which I have not personally used, an example is https://github.com/etourdot/vertx-marklogic (reactive, vert.x marklogic API).
MarkLogic is a 'reactive' design internally in that it implements the functionality the modern 'reactive' term is used to describe -- but does not expose any standard 'reactive APIs' for this (there are very few standards in this area). Code running within MarkLogic server (xquery,javascript) implicitly benefits from this - although there is not an explicit backpressure API, a side effect of single threaded blocking IO (from the app perspective) is that the equivalent of 'back pressure' is implemented by implicit flow control of the IO APIS - you cannot over drive a properly configured ML server on a single thread doing blocking IO. Connections to an overloaded server will take longer and eventually time out ('backpressure' :)
Similarly, (most of) the external APIs (REST, XCC) are also blocking, single threaded.
The server core manages rate control via a variety of methods such as actively managing the TCP connection queue size, keep alive times, numbers of active threads etc.
In general the server does a very good job at this without explicit low level programming needed, balancing the latency across all clients. If this needs improving, the administration guides have good direction on how to tune the various parameters so the system behaves well on its own.
If you want to implement a per-connection client aware 'reactive' API you will need to implement it yourself. This can be done using the same techniques used for other blocking IO APis -- i.e. either use multiple threads or non-blocking IO. Some of the ML SDK's have provision for non-blocking IO or control over timeouts which can be used to implement a 'reactive' API.
Similarly, code running in the server itself (XQuery or JavaScript) can implement 'reactive' type behaviour by making use of the task queue -- as exposed by the xdmp:spawn-xxx apis. This is done in many libraries to manage bulk ingest. Care must be taken to carefully control the amount of concurrency as you can easily overload the server by spawning too many concurrent requests. Managing state is a bit tricky as there is a interaction/opposition between the transaction model and task creation -- the former generally presenting an idempotent view of data that can be incongruous with the concept of 'current' wrt asynchronous tasks.
There is a technical requirement to scale a new system easily. This new system consists of three tiered applications (as a batch processors). Each tier will contains at least 2 servers with the same application resides on each server.
So, when one of the tier reaches peak performance, we could extend the scalability easily by adding a new server and the same application to off-load some of the processing loads.
The problem is that one or two of the three tiers require heavy caching (about 3 million records and increasing).
I'm thinking of using distributed caching system to overcome this problem but the new distributed caching system will means an additional point of failure as applications now need to interact with additional caching systems for processing.
I'm currently looking at ncache but just wondering if there is an alternatives to this problem? or is there any other comparable distributed caching system that maybe similar or better than ncache and provide enterprise supports too?
Thanks,
Chen
You can find in this IBM article (expired) the main actors in DCP (Distributed Caching Platforms) environment.
The alternative we are using (not free) is Gigaspace XAP.
Chen -
It sounds like you could definitely use a distributed caching system, or even an in-memory data grid (IMDG). Here's some highlights of Oracle Coherence (previously Tangosol Coherence):
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
RESTful APIs available from any language. Native APIs and client libraries for C/C++, C#, .NET and Java.
In addition to simple key-value (K/V) caching, also support queries (including some SQL), parallel queries, indexes (including custom indexes), a rich eventing model (for event-driven systems like exchanges), transactions (including MVCC), parallel execution of both scalar (EntryProcessor) and aggregate (ParallelAwareAggregator) functions, cache triggers, etc.
Easy to integrate with a database via read-through, read-ahead, write-through and write-behind caching. Automatically refreshes just the changed data when changes occur to the database (leveraging Oracle GoldenGate technology).
There's a summary of the In-Memory Data Grid market by Gartner called "Competitive Landscape: In-Memory Data Grids". You can see a copy at: http://www.gartner.com/technology/reprints.do?id=1-1HCCIMJ&ct=130718&st=sb
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
We are implementing a system in the company I work for where by we will need to install the system in various sites of the same client (warehouses). The users in all sites should see the same information. The system should be able to work in each site when the network is down. What design architecture solution would be most suitable?
I suggest you consider CouchDB. Its robust replication feature is designed specifically for this sort of use case. It supports both continuous replication, which could keep the data in the various warehouses in sync in near-real-time during normal operation, and occasional replication, which could be used to sync data after a network outage.
There's a really good free O'Reilly book: CouchDB: The Definitive Guide, which has a chapter on replication.
We're in the process of creating a new API for our product, which will be exposed via web services. We have an internal debate whether the API should be as easy as possible to use (in the price of making more calls) or make it as efficient as possible (rendering it harder to use). For example, here are two issues that came up:
Should we manage session info on the server, or should we pass that info back to the user, and expect him to send it back to us when needed? (Please ignore the security implications of a session)
Should we combine calls that are likely to be consequent, in order to save the time spent on the round trip in between, even if they don't really share the same logical functionality?
Basically, the desktop people are in favor of a clear, easy to use API, while the internet people would like to make it as efficient as possible. This is the first public API we're providing, and we need a strategy.
Personally, I'm in favor of making the API as usable as possible. Other components on the system are probably going to have a much larger affect on the performance, and hard to use APIs are much more error prone. But I'm a desktop programmer…
So, what should be our strategy? What are the common practices when creating such an API?
What are typical usage scenarios? Will the latency of your API determine the UI responsiveness? How big the performance trade-off will be? It's hard to suggest anything without knowing your circumstances.
But
My guess is that passing session info to the client will scale better. In-proc session management won't allow you to share the state between service instances. Managing sessions in DB will make your services more complex. Anyway this all depends on your bandwidth/memory/computational power capabilities.
I would start from the most granular operations and only provide composite methods when performance problem becomes obvious.
IMO, the best is to have the simplest API for people who will have to use your API, but letting these people have deep customizing possibilities, to make it efficient, even if harder to use.
But simplicity for users > all.
I would recommend you create your web service following the RESTful model and that means stateless. Efficiency is more critical than ease of use. You can always create a framework on top of the API later that eases implementation headaches.
You're debating a circular agrument. This is all usability (ease of use).
You're likely going to factor in a bit of both aspects - as they both influence user performance.
Its a case of (i) extent of customsiable features versus (ii) efficiency in manipulating those features. There will be an intersect between the two.
I'd say give a simple API that puts control into the users hands - this is the primary purpose of CMS. The more detailed aspects you could combine initially, introducing them as added control later on.
This way you will manage users learning curve of the system so as (i) not to bombard them with excessive options initially and (ii) allow them to adopt your system quickly at first. Then extend your users control (system functionality from a user perspective) later on by making more of the API features available.
Another good tip would be to ask you users upfront & as you go.
My 2 cents:
First start with a very granular low level API that gives high performance.
Then create a easy-to-use high level API that is "composed" of the above low level API.
This way clients can customize their behavior as they want. The clients can start with using the high-level API but if high performance is expected for certain user actions, they can use the faster-but-difficult low level API on a case-by-case basis.
One more important point to consider in service design. Try to keep them as stateless as possible. The advantage of stateless web services is that they can be easily distributed using Network Load Balancing.
We have a new project for a web app that will display banners ads on websites (as a network) and our estimate is for it to handle 20 to 40 billion impressions a month.
Our current language is in ASP...but are moving to PHP. Does PHP 5 has its limit with scaling web application? Or, should I have our team invest in picking up JSP?
Or, is it a matter of the app server and/or DB? We plan to use Oracle 10g as the database.
No offense, but I strongly suspect you're vastly overestimating how many impressions you'll serve.
That said:
PHP or other languages used in the application tier really have little to do with scalability. Since the application tier delegates it's state to the database or equivalent, it's straightforward to add as much capacity as you need behind appropriate load balancing. Choice of language does influence per server efficiency and hence costs, but that's different than scalability.
It's scaling the state/data storage that gets more complicated.
For your app, you have three basic jobs:
what ad do we show?
serving the add
logging the impression
Each of these will require thought and likely different tools.
The second, serving the add, is most simple: use a CDN. If you actually serve the volume you claim, you should be able to negotiate favorable rates.
Deciding which ad to show is going to be very specific to your network. It may be as simple as reading a few rows from a database that give ad placements for a given property for a given calendar period. Or it may be complex contextual advertising like google. Assuming it's more the former, and that the database of placements is small, then this is the simple task of scaling database reads. You can use replication trees or alternately a caching layer like memcached.
The last will ultimately be the most difficult: how to scale the writes. A common approach would be to still use databases, but to adopt a sharding scaling strategy. More exotic options might be to use a key/value store supporting counter instructions, such as Redis, or a scalable OLAP database such as Vertica.
All of the above assumes that you're able to secure data center space and network provisioning capable of serving this load, which is not trivial at the numbers you're talking.
You do realize that 40 billion per month is roughly 15,500 per second, right?
Scaling isn't going to be your problem - infrastructure period is going to be your problem. No matter what technology stack you choose, you are going to need an enormous amount of hardware - as others have said in the form of a farm or cloud.
This question (and the entire subject) is a bit subjective. You can write a dog slow program in any language, and host it on anything.
I think your best bet is to see how your current implementation works under load. Maybe just a few tweaks will make things work for you - but changing your underlying framework seems a bit much.
That being said - your infrastructure team will also have to be involved as it seems you have some serious load requirements.
Good luck!
I think that it is not matter of language, but it can be be a matter of database speed as CPU processing speed. Have you considered a web farm? In this way you can have more than one machine serving your application. There are some ways to implement this solution. You can start with two server and add more server as the app request more processing volume.
In other point, Oracle 10g is a very good database server, in my humble opinion you only need a stand alone Oracle server to commit the volume of request. Remember that a SQL server is faster as the people request more or less the same things each time and it happens in web application if you plan your database schema carefully.
You also have to check all the Ad Server application solutions and there are a very good ones, just try Google with "Open Source AD servers".
PHP will be capable of serving your needs. However, as others have said, your first limits will be your network infrastructure.
But your second limits will be writing scalable code. You will need good abstraction and isolation so that resources can easily be added at any level. Things like a fast data-object mapper, multiple data caching mechanisms, separate configuration files, and so on.