Frequent Updates on Apache Ignite - caching

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?

It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

Related

extremely high SSD write rate with multiple concurrent writers

I'm using QuestDB as backend for storing collected data using the same script for different data sources.
My problem ist the extremly high disk (ssd) usage. During 4 days it has written 335MB per second.
What am I doing wrong?
Inserting data using the ILP interface
sender.row(
metric,
symbols=symbols,
columns=data,
at=row['ts']
)
I don't know how much data you are ingesting, so not sure if 335 MB per second is much or not. But since you are surprised by it I am going to assume your throughput is lower than that. It might be the case your data is out of order, specially if ingesting from multiple data sources.
QuestDB keeps the data per table always in incremental order by designated timestamp. If data arrives out of order, the whole partition needs to be rewritten. This might lead to write amplification where you see your data is being rewritten very often.
Until literally a few days ago, to fine tune this you would need to change the default config, but since version 6.6.1, this is dynamically adjusted.
Maybe you want to give a try to version 6.6.1, or alternatively if data from different sources is arriving out of order (relative to each other), you might want to create separate tables for different sources, so data is always in order for each table.
I have been experimenting a lot and it seems that you're absolutely right. I was ingesting 14 different clients into a single table. After having splitted this to 14 tables, one for each client, the problem disappeared.
Another advantage is the fact that I need a symbol less as I do not have to distinguish the rows.
By the way - thank you and your team for this marvellous tool you gave us! It makes my work so much easier!!
Saludos

microservices, caching and load balancing design patterns

I have a real time data intensive application that uses a local in app/memory cache
40,000 vehicles are sending data to 1 server (every 5 secs), I have to work out distance travelled between previous and current location.
To do this I cache each vehicles previous lat,lon, then when I see a new bit of data, I take the new lat,lon and work out the distance travelled between the points (i.e. 5 foot) and then add this to the accumulating odometer on the vehicle (i.e. 60,000 miles)
I need to start load balancing this to handle scale,
Using a local cache would obviously be out of date when it hits the 2 different servers.
however, using a distributed cache seems like I would massively slow down processing due to the network hop to a shared cache, especially with the volumes and frequency as mentioned above.
One solution could be using a sticky session so car A always goes to server A and periodically update the in memory cache in case a server goes down.
However I'm sure this problem has been solved in the past,
Are there industry caching patterns to use in this scenario ?
I am wondering how this went for you. I would have started with the sticky session with an in memory cache option, given the nature of the load. It appears that one vehicle can be assigned to a single server, and a local cache can track the previous lat, lng. Only thing once a car stops sending data, you need to be able to recognize that and release the server for the next car. Anyway curious to know how did it worked out. Interesting problem.

Apache Kylin fault tolerance

Apache Kylin looks like a great tool that will fill in the needs of a lot data scientists. It's also a very complex system. We are developing an in-house solution with exactly the same goal in mind, multidimensional OLAP cube with low query latency.
Among the many issues, the one I'm concerned of the most right now is about fault tolerance.
With large volumes of incoming transactional data, the cube must be incrementally updated, and some of the cuboids are updated over long period of time such as those with time dimension value at the scale of year. Over such long period, some piece of the complex system is guaranteed to fail, and how does the system ensure all the raw transactional records are aggregated into the cuboids exactly once, no more no less? Even each of the pieces has its own fault tolerance mechanism, it doesn't mean they will play together automatically.
For simplicity, we can assume all the input data are saved in HDFS by another process, and can be "played back" in any way you want to recover from any interruption, voluntary or forced. What are Kylin's fault tolerance considerations, or is it not really an issue?
There are data faults and system faults.
Data fault tolerance: Kylin partitions cube into segments and allows rebuild an individual segment without impacting the whole cube. For example, assume a new daily segment is built on daily basis and get merged into weekly segment on weekend; weekly segments merge into monthly segment and so on. When there is data error (or whatever change) within a week, you need to rebuild only one day's segment. Data changes further back will require rebuild a weekly or monthly segment.
The segment strategy is fully customizable so you can balance the data error tolerance and query performance. More segments means more tolerable to data changes but also more scans to execute for each query. Kylin provides RESTful API, an external scheduling system can invoke the API to trigger segment build and merge.
A cube is still online and can serve queries when some of its segments is under rebuild.
System fault tolerance: Kylin relies on Hadoop and HBase for most system redundancy and fault tolerance. In addition to that, every build step in Kylin is idempotent. Meaning you can safely retry a failed step without any side effect. This ensures the final correctness, no matter how many fails and retries the build process has been through.
(I'm also Apache Kylin co-creator and committer. :-)
Notes: I'm Apache Kylin co-creator and committer.
The Fault Tolerance point is really good one which we actually be asked from some cases, when they have extreme large datasets. To calculate again from begin will require huge computing resources, network traffic and time.
But from product perspective, the question is: which one is more important between precision result and resources? For transaction data, I believe the exactly number is more important, but for behavior data, it should be fine, for example, the distinct count value is approximate result in Kylin now. It depends what's kind of case you will leverage Kylin to serve business needs.
Will put this idea into our backlog and will update to Kylin dev mailing list if we have more clear clue for this later.
Thanks.

Strategy for "user data" in couchbase

I know that a big part of the performance from Couchbase comes from serving in-memory documents and for many of my data types that seems like an entirely reasonable aspiration but considering how user-data scales and is used I'm wondering if it's reasonable to plan for only a small percentage of the user documents to be in memory all of the time. I'm thinking maybe only 10-15% at any given time. Is this a reasonable assumption considering:
At any given time period there will be a only a fractional number of users will be using the system.
In this case, users only access there own data (or predominantly so)
Recently entered data is exponentially more likely to be viewed than historical user documents
UPDATE:
Some additional context:
Let's assume there's a user base of a 1 million customers, that 20% rarely if ever access the site, 40% access it once a week, and 40% access it every day.
At any given moment, only 5-10% of the user population would be logged in
When a user logs in they are like to re-query for certain documents in a single session (although the client does do some object caching to minimise this)
For any user, the most recent records are very active, the very old records very inactive
In summary, I would say of a majority of user-triggered transactional documents are queried quite infrequently but there are a core set -- records produced in the last 24-48 hours and relevant to the currently "logged in" group -- that would have significant benefits to being in-memory.
Two sub-questions are:
Is there a way to indicate a timestamp on a per-document basis to indicate it's need to be kept in memory?
How does couchbase overcome the growing list of document id's in-memory. It is my understanding that all ID's must always be in memory? isn't this too memory intensive for some apps?
First,one of the major benefits to CB is the fact that it is spread across multiple nodes. This also means your queries are spread across multiple nodes and you have a performance gain as a result (I know several other similar nosql spread across nodes - so maybe not relevant for your comparison?).
Next, I believe this question is a little bit too broad as I believe the answer will really depend on your usage. Does a given user only query his data one time, at random? If so, then according to you there will only be an in-memory benefit 10-15% of the time. If instead, once a user is on the site, they might query their data multiple times, there is a definite performance benefit.
Regardless, Couchbase has pretty fast disk-access performance, particularly on SSDs, so it probably doesn't make much difference either way, but again without specifics there is no way to be sure. If it's a relatively small document size, and if it involves a user waiting for one of them to load, then the user certainly will not notice a difference whether the document is loaded from RAM or disk.
Here is an interesting article on benchmarks for CB against similar nosql platforms.
Edit:
After reading your additional context, I think your scenario lines up pretty much exactly how Couchbase was designed to operate. From an eviction standpoint, CB keeps the newest and most-frequently accessed items in RAM. As RAM fills up with new and/or old items, oldest and least-frequently accessed are "evicted" to disk. This link from the Couchbase Manual explains more about how this works.
I think you are on the right track with Couchbase - in any regard, it's flexibility with scaling will easily allow you to tune the database to your application. I really don't think you can go wrong here.
Regarding your two questions:
Not in Couchbase 2.2
You should use relatively small document IDs. While it is true they are stored in RAM, if your document ids are small, your deployment is not "right-sized" if you are using a significant percentage of the available cluster RAM to store keys. This link talks about keys and gives details relevant to key size (e.g. 250-byte limit on size, metadata, etc.).
Basically what you are making a decision point on is sizing the Couchbase cluster for bucket RAM, and allowing a reduced residency ratio (% of document values in RAM), and using Cache Misses to pull from disk.
However, there are caveats in this scenario as well. You will basically also have relatively constant "cache eviction" where "not recently used" values are being removed from RAM cache as you pull cache missed documents from disk into RAM. This is because you will always be floating at the high water mark for the Bucket RAM quota. If you also simultaneously have a high write velocity (new/updated data) they will also need to be persisted. These two processes can compete for Disk I/O if the write velocity exceeds your capacity to evict/retrieve, and your SDK client will receive a Temporary OOM error if you actually cannot evict fast enough to open up RAM for new writes. As you scale horizontally, this becomes less likely as you have more Disk I/O capacity spread across more machines all simultaneously doing this process.
If when you say "queried" you mean querying indexes (i.e. Views), this is a separate data structure on disk that you would be querying and of course getting results back is not subject to eviction/NRU, but if you follow the View Query with a multi-get the above still applies. (Don't emit entire documents into your Index!)

Cassandra scaling cheat-sheet

Of course you can only know the performance of your system with your load with your use-cases by ... actually implementing it! That aside, before embarking on a prototype, I'm searching for some very rough estimates of how Cassandra performs.
For various configurations of nodes and data-centres, and for various read and write consistency levels, what the chances of reading a stale value? What kind of key reads and writes per second would you expect to sustain, and what kind of latency would each read and write have?
Cassandra benchmarking presented at VLDB earlier this year: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Performance/consistency tradeoffs: http://www.datastax.com/dev/blog/your-ideal-performance-consistency-tradeoff
We run an application with 500 datapoints posted per second per web node(we have 6 cassandra nodes). We could probably get 1000 datapoints per second per node if we cached 100M of data in the client to avoid the read.
the profile of that is using PlayOrm with one findAll(List keys) and one putAll(List entities) on each request where each key in that list is a single data point as the clients send a batch of datapoints over http so we don't have as much http overhead....maybe that gives you some idea at least though not sure.
We have not yet tested the correct ratio of web nodes to cassandra nodes but I suspect it is like my last client where it was near one to one on this project though it changes with the profile.
We run 4 web nodes and get 2000 datapoints per second right now.

Resources