Will using IMDG on top of NoSql really boost the performance of the application? - imdb

Does using a IMDG layer (Gemfire, HazelCast) on top of the NoSql db (MongoDB) improve the performance and scaleability of the application?

Maybe. As always, "It Depends".
On what? Primarily the characteristics of your data and your data access patterns. A few considerations (of many!) might be:
How much data do you have, and is it feasible and/or cost-effective to even attempt to cache a portion of it in memory?
Are there "hot" subsets of the data which are candidates for caching? For example, in social apps the last x hours of data drive virtually all the load.
What cache hit % do you need to achieve in order to benefit from the cache? (This is driven by your app's sensitivity to latency)
What is the ratio between reads/writes?
What are the consistency and reliability requirements?
What types of reads does your app do? Complex queries? Simple GETs? In which mix?
Caching layers can be incredibly effective when used correctly. But they aren't a silver bullet. Don't expect to just slap one over your existing store and magically get "faster" or "more scalable"...

Related

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

Data consistency for NoSQL + Distributed Cache in very concurrent environment

On the slide you can see very rough architecture of booking system. It's very concurrent environment, where many users at once may try to book the same hotel/room.
At bottom we have NoSQL database, for quick response/request there is distributed cache and application which requests data.
The idea of this slide is that when you use NoSQL + Distributed Cache you'll get sync problems, means data consistency problems. You need to sync distributed cache with NoSQL db.
Question: What the solutions/techniques already exists for such case besides IMDG? That could be both frameworks or/and best practices. Is there any specific distributed caches that solves this problem?
Question2[updated]: What are the reasons we do write to the NoSQL db instead of cache? Are that transactions, node fail possibility or anything else?
P.S. That's not my slide, and author claimed that is a great use case for IMDG.
Do you really need the distributed cache? NoSQL solutions are by nature very performant, approaching the performance of stand-alone caches (like memcached).
I can get ~10ms access times out of Cassandra, which is not much slower than most caches.
I'll bet that by the time you put in cache validation overhead, and network overhead of missed cache hits, you are going to be better off going straight to your database.
You can still use caches for things that are less transient, like room types, prices, etc.

Performance impact of having a data access layer/service layer?

I need to design a system which has these basic components:
A Webserver which will be getting ~100 requests/sec. The webserver only needs to dump data into raw data repository.
Raw data repository which has a single table which gets 100 rows/s from the webserver.
A raw data processing unit (Simple processing, not much. Removing invalid raw data, inserting missing components into damaged raw data etc.)
Processed data repository
Does it make sense in such a system to have a service layer on which all components would be built? All inter-component interaction will go through the service layers. While this would make the system easily upgradeable and maintainable, would it not also have a significant performance impact since I have so much traffic to handle?
Here's what can happen unless you guard against it.
In the communication between layers, some format is chosen, like XML. Then you build it and run it and find out the performance is not satisfactory.
Then you mess around with profilers which leave you guessing what the problem is.
When I worked on a problem like this, I used the stackshot technique and quickly found the problem. You would have thought it was I/O. NOT. It was that converting data to XML, and parsing XML to recover data structure, was taking roughly 80% of the time. It wasn't too hard to find a better way to do that. Result - a 5x speedup.
What do you see as the costs of having a separate service layer?
How do those costs compare with the costs you must incur? In your case that seems to be at least
a network read for the request
a database write for raw data
a database read of raw data
a database write of processed data
Plus some data munging.
What sort of services do you have a mind? Perhaps
saveRawData()
getNextRawData()
writeProcessedData()
why is the overhead any more than a procedure call? Service does not need to imply "separate process" or "web service marshalling".
I contend that structure is always of value, separation of concerns in your application really matters. In comparison with database activities a few procedure calls will rarely cost much.
In passing: the persisting of Raw data might best be done to a queuing system. You can then get some natural scaling by having many queue readers on separate machines if you need them. In effect the queueing system is naturally introducing some service-like concepts.
Personally feel that you might be focusing too much on low level implementation details when designing the system. Before looking at how to lay out the components, assemblies or services you should be thinking of how to architect the system.
You could start with the following high level statements from which to build your system architecture around:
Confirm the technical skill set of the development team and the operations/support team.
Agree on an initial finite list of systems that will integrate to your service, the protocols they support and some SLAs.
Decide on the messaging strategy.
Understand how you will deploy your service/system.
Decide on the choice of middleware (ESBs, Message Brokers, etc), databases (SQL, Oracle, Memcache, DB2, etc) and 3rd party frameworks/tools.
Decide on your caching and data latency strategy.
Break your application into the various areas of business responsibility - This will allow you to split up the work and allow easier communication of milestones during development/testing and implementation.
Design each component as required to meet the areas of responsibility. The areas of responsibility should automatically lead you to decide on how to design component, assembly or service.
Obviously not all of the above will match your specific case but I would suggest that they should at least be given some thought.
Good luck.
Abstraction and tiering will introduce latency, but the real question is, what are you GAINING to make the cost(s) worthwhile? Loose coupling, governance, scalability, maintainability are worth real $.
Even the best-designed layered app will exhibit more latency than an app talking directly to a DB. Users who know the original system will feel the difference. They may not like it, so this can be a political issue as much as a technical one.

Storing images in NoSQL stores

Our application will be serving a large number of small, thumbnail-size images (about 6-12KB in size) through HTTP. I've been asked to investigate whether using a NoSQL data store is a viable solution for data storage. Ideally, we would like our data store to be fault-toerant and distributed.
Is it a good idea to store blobs in NoSQL stores, and which one is good for it? Also, is NoSQL a good solution for our problem, or would we be better served storing the images in the file system and serving them directly from the web server (as an aside, CDN is currently not an option for us)?
Whether or not to store images in a DB or the filesystem is sometime one of those "holy war" type of debates; each side feels their way of doing things is the one right way. In general:
To store in the DB:
Easier to manage back-up/replicate everything at once in one place.
Helps with your data consistency and integrity. You can set the BLOB field to disallow NULLs, but you're not going to be able to prevent an external file from being deleted. (Though this isn't applicable to NoSQL since there aren't the traditional constraints).
To store on the filesystem:
A filesystem is designed to serve files. Let it do it's job.
The DB is often your bottleneck in an application. Whatever load you can take off it, the better.
Easier to serve on a CDN (which you mentioned isn't applicable in your situation).
I tend to come down on the side of the filesystem because it scales much better. But depending on the size of your project, either choice will likely work fine. With NoSQL, the differences are even less apparent.
Mongo DB should work well for you. I haven't used it for blobs yet, but here is a nice FLOSS Weekly podcast interview with Michael Dirolf from the Mongo DB team where he addresses this use case.
I was looking for a similar solution for a personal project and came across Riak, which, to me, seems like an amazing solution to this problem. Basically, it distributes a specified number of copies of each file to the servers in the network. It is designed such that a server coming or going is no big deal. All the copies on a server that leaves are distributed amongst the others.
With the right configuration, Riak can deal with an entire datacenter crashing.
Oh, and it has commercial support available.
Well CDN would be the obvious choice. Since that's out, I'd say your best bet for fault tolerance and load balancing would be your own private data center (whatever that means to you) behind 2 or more load balancers like an F5. This will be your easiest management system and you can get as much fault tolerance as your hardware budget allows. You won't need any new software expertise, just XCOPY.
For true fault tolerance you're going to need geographic dispersion or you're subject to anyone with a backhoe.
(Gravatars?)
If you are in a Python environment, consider the y_serial module: http://yserial.sourceforge.net/
In under 10 minutes, you will be able to store and access your images (in fact, any arbitrary Python object including webpages) -- in compressed form; NoSQL.

Resources