In-Memory Data Grid for Java Project [closed] - caching

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking to use an in memory data grid for my java project. I know there are a few relevant products such as VMWare GemFire, GigaSpaces XAP, IBM eXtreme Scale and others. Can someone elaborate from their experience with any of these tools and how they compare to one another? Thanks, Alex

(disclaimer - I work for GigaSpaces)
Hi Alex
There are many criteria to compare by, it really depends on what you're trying to do. in memory data grid have a lot of use cases, e.g. caching, OLTP, high throughput event processing, etc.
In general, the main criteria you should be looking at are:
Programming model: Support popular Java frameworks such as Spring (XAP and Gemfire support it natively)
Querying and indexing: if you want more than trivial key/value data access. Most people need SQL like semantics, or even full text search, and if the data grid can provide that out of the box it's a big advantage.
Ability to execute code on the grid nodes, and even colocate your code with them and handle events that are injected to the grid (e.g. objects written or updated). This is a massive scalability benefit and allows you to implement very efficient shared-nothing architectures.
Languages and APIs support: Most data grid support at least Java and JVM based languages (e.g. Scala), but a lot of them also support other languages and allow you to access the same data from various programming languages. For example XAP supports natively Java, .Net and C++, and other languages using its REST and memcached interfaces. As far as APIs go, some grid support more than one API. At GigaSpaces we support Map, Spring/POJO, JPA, JDBC and others.
Transactions: This is also a big one if you want to go anywhere beyond caching. When using the memory as your system of record, you should be able to rollback state in case you have an error or a bug, otherwise you end up with corrupt data. Another important thing is what types of transactions are supported. A lot of data grids only support "local" transactions. i.e. within the boundaries of a single node / partition / shard (which is probably what you want to do in most cases for performance reason). But more advanced grids also support distributed transactions and know how to seamlessly upgrade from local to distributed when needed.
Replication: there are various models here (synchronous, asynchronous, hybrid) and you need to decide which one of them is best for your use case. Some grids also have explicit support for cross cluster replication over WAN which is important if you're implementing DR.
Data partitioning and scalability: how does the grid partition data (fixed / consistent hashing), what is the level of control the user has over it, and does it support dynamic addition of server to the grid to increase capacity.
Administration and monitoring: Last but not least - what kind of facilities are provided out of the box, such as monitoring and administration hooks (JMX or another administrative API), user interfaces and integration with other 3rd party systems.
The following links are a good place to start:
http://gojko.net/2009/06/01/oracle-coherence-vs-gigaspaces-xap/. Also read the comments
http://www.neovise.com/neovise-data-caching-performance-technical-white-paper - a recent comparison between GigaSpaces and GemFire which I think speaks for itself :)
HTH,
Uri

There's a summary of the In-Memory Data Grid market by Gartner called "Competitive Landscape: In-Memory Data Grids". You can see a copy at: http://www.gartner.com/technology/reprints.do?id=1-1HCCIMJ&ct=130718&st=sb
Oracle Coherence (previously Tangosol Coherence) is and has long been the market leader in this space, but it is a commercial product. As I explained elsewhere recently:
Pros:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
RESTful APIs available from any language. Native APIs and client libraries for C/C++, C#, .NET and Java.
In addition to simple key-value (K/V) caching, also support queries (including some SQL), parallel queries, indexes (including custom indexes), a rich eventing model (for event-driven systems like exchanges), transactions (including MVCC), parallel execution of both scalar (EntryProcessor) and aggregate (ParallelAwareAggregator) functions, cache triggers, etc.
Easy to integrate with a database via read-through, read-ahead, write-through and write-behind caching. Automatically refreshes just the changed data when changes occur to the database (leveraging Oracle GoldenGate technology).
Coherence Incubator: The Coherence Incubator
memcache protocol support: Memcached interface for Oracle Coherence project on github
Thousands of customers, some using Coherence in production now for over a decade.
Cons:
As of Coherence 12.1.2, the cache itself is not persistent.
It costs money.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Related

Highload data update architecture

I'm developing a parcels tracking system and thinking of how to improve it's performance.
Right now we have one table in postgres named parcels containing things like id, last known position, etc.
Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.
Given such requirements what could you suggest about project architecture?
Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.
Majors downsides of this solution are:
possible deadlocks, as fetching data about the same task can be executed at the same time on different machines.
need in control of queue size
This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:
Infrastructure & processes for hosting distributed systems, such as Kubernetes, Containers, CI/CD.
Scalable forms of persistence. For example different forms of distributed NoSQL like key-value stores, wide-column stores, in-memory databases and novel scalable SQL stores.
Scalable forms of data flows and processing. For example event driven architectures using distributed message- / event-queues.
For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).
It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.
Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Which are the cons of a purely stream-based architecture against a Lambda architecture? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Disclaimer: I'm not a real-time architectures expert, I'd like only to throw a couple of personal considerations and evaluate what others would suggest or point out.
Let's imagine we'd like to design a real-time analytics system. Following, Lambda architecture Nathan Marz's definition, in order to serve the data we would need a batch processing layer (i.e. Hadoop), continuously recomputing views from a dataset of all the data, and a so-called speed layer (i.e. Storm) that constantly processes a subset of the views (made by the events coming in after the last full recomputation of the batch layer). You query your system by merging the results of the two together.
The rationale behind this choice makes perfect sense to me, and its a combination of software engineering and systems engineering observations. Having a ever-growing master dataset of immutable timestamped facts makes the system resilient to human errors in computing the views (if you do an error, you just fix it and recompute them in the batch layer) and enables the system to answer to virtually any query that would come up in the future. Also, such datastore would require to support only random reads and batch inserts, whereas the datastore for the speed/real-time part would require to support efficiently random reads and random writes, increasing its complexity.
My objection/trigger for a discussion about this is that, in certain scenarios, this approach might be an overkill. For the sake of discussion, assume we do a couple of simplifications:
Let's assume that in our analytics system we can define beforehand an immutable set of use-cases\queries that hour system needs to be able to provide, and that they won't change in the future.
Let's assume that we have a limited amount of resources (engineering power, infrastructure, etc) to implement it. Storing the whole set of elementary events coming to our system, instead of already precomputing views\aggregates, may just be too expensive.
Let's assume that the we succesfully minimize the impact of human mistakes (...).
The system would still need to be scalable and handle ever-increasing traffic and data.
Given these observations, I'd like to know what would stop us from designing a fully stream-oriented architecture. What I imagine is an architecture where the events (i.e. page views) are pushed inside a stream, that could be RabbitMQ + Storm or Amazon Kinesis, and where the consumers of such streams would directly update the needed views through random writes/updates to a NoSQL database (i.e. MongoDB).
At a first approximation, it looks to me that such architecture could scale horizontally. Storm can be clusterized, and Kinesis expected QoS could also be reserved upfront. More incoming events would mean more stream consumers, and as they are totally independent nothing stops us from adding more. Regarding the database, sharding it with a proper policy would let us distribute the increasing number of writes to an increasing number of shards. In order to avoid reads to be affected, each shard could have one or more read-replicas.
In terms of reliability, Kinesis promises to reliabily store your messages for up to 24 hours, and a distributed RabbitMQ (or whatever queue system of your choice) with proper usage of acknowledges' mechanisms could probably satisfy the same requirement.
Amazon's documentation on Kinesis deliberately (I believe) avoids to lock you into a specific architectural solution, but my overall impression is that they would like to push developers to simplify the Lambda architecture and arrive to a fully stream based solution similar to the one I've exposed.
To be slighly more compliant to the Lambda architecture requirements, nothing stops us from having, in parallel with the consumers constantly updating our views, a set of consumers that process the incoming events and store them as atomic immutable units in a different datastore that could be used in the future to produce new views (via Hadoop for instance) or recompute faulty data.
What's your opinion on this reasoning? I'd like to know in which scenarios a purely stream-based architecture would fail to scale up, and if you have any other observations, pros\cons of a Lambda architecture vs a stream-based architecture.

Performance impact of having a data access layer/service layer?

I need to design a system which has these basic components:
A Webserver which will be getting ~100 requests/sec. The webserver only needs to dump data into raw data repository.
Raw data repository which has a single table which gets 100 rows/s from the webserver.
A raw data processing unit (Simple processing, not much. Removing invalid raw data, inserting missing components into damaged raw data etc.)
Processed data repository
Does it make sense in such a system to have a service layer on which all components would be built? All inter-component interaction will go through the service layers. While this would make the system easily upgradeable and maintainable, would it not also have a significant performance impact since I have so much traffic to handle?
Here's what can happen unless you guard against it.
In the communication between layers, some format is chosen, like XML. Then you build it and run it and find out the performance is not satisfactory.
Then you mess around with profilers which leave you guessing what the problem is.
When I worked on a problem like this, I used the stackshot technique and quickly found the problem. You would have thought it was I/O. NOT. It was that converting data to XML, and parsing XML to recover data structure, was taking roughly 80% of the time. It wasn't too hard to find a better way to do that. Result - a 5x speedup.
What do you see as the costs of having a separate service layer?
How do those costs compare with the costs you must incur? In your case that seems to be at least
a network read for the request
a database write for raw data
a database read of raw data
a database write of processed data
Plus some data munging.
What sort of services do you have a mind? Perhaps
saveRawData()
getNextRawData()
writeProcessedData()
why is the overhead any more than a procedure call? Service does not need to imply "separate process" or "web service marshalling".
I contend that structure is always of value, separation of concerns in your application really matters. In comparison with database activities a few procedure calls will rarely cost much.
In passing: the persisting of Raw data might best be done to a queuing system. You can then get some natural scaling by having many queue readers on separate machines if you need them. In effect the queueing system is naturally introducing some service-like concepts.
Personally feel that you might be focusing too much on low level implementation details when designing the system. Before looking at how to lay out the components, assemblies or services you should be thinking of how to architect the system.
You could start with the following high level statements from which to build your system architecture around:
Confirm the technical skill set of the development team and the operations/support team.
Agree on an initial finite list of systems that will integrate to your service, the protocols they support and some SLAs.
Decide on the messaging strategy.
Understand how you will deploy your service/system.
Decide on the choice of middleware (ESBs, Message Brokers, etc), databases (SQL, Oracle, Memcache, DB2, etc) and 3rd party frameworks/tools.
Decide on your caching and data latency strategy.
Break your application into the various areas of business responsibility - This will allow you to split up the work and allow easier communication of milestones during development/testing and implementation.
Design each component as required to meet the areas of responsibility. The areas of responsibility should automatically lead you to decide on how to design component, assembly or service.
Obviously not all of the above will match your specific case but I would suggest that they should at least be given some thought.
Good luck.
Abstraction and tiering will introduce latency, but the real question is, what are you GAINING to make the cost(s) worthwhile? Loose coupling, governance, scalability, maintainability are worth real $.
Even the best-designed layered app will exhibit more latency than an app talking directly to a DB. Users who know the original system will feel the difference. They may not like it, so this can be a political issue as much as a technical one.

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources