A better idea/data structure to collect analytics data

A better idea/data structure to collect analytics data - data-structures

I'm collecting analytics data. I'm using a master map that holds many other nested maps.
Considering maps are immutable, many new maps are going to be allocated. (Yes, that is efficient in Clojure).
Basic operation that I'm using is update-in , very convenient to update a value for a given path or create the binding for a non-existant value.
Once I reached a specific point, I'm going to save that data structure to the data base.
What would be a better idea to collect these data more efficiently in Clojure? a transient data structure?

As with all optimizations, measure first, and If the map update is a bottle neck then switching to a transient map is a rather unintrusive code change. If you find that GC overhead is the real culprit, as it often is with persistent data structures, and transients dont help enough then collecting the data into a list and batch adding it into a transient map which is made persistent and saved into the DB at the end may be a more effective though larger change. Adding to a list produces very little GC overhead because unlike adding to a map the old head does not need to be discarded and GCd

Related

What is the name of this kind of cache/ data structure?

I need a fixed-size cache of objects that keeps track how many times each object was requested. When it is full and a new object is added, the object with the lowest usage score gets removed.
So this is different from a LRU-cache of size N in that if some object is heavily requested, then even adding N new objects won't push it out of cache.
Some kind of mix of a cache and a priority queue. Is there a name for that?
Thanks!

Without a time element, this kind of cache clogs up with things that were used a lot in the past, but aren't used currently. Replacement becomes impossible, because everything in the cache has been used more than once, so you won't evict anything in favor of a new item.
You could write some code that degrades the value of the count over time (i.e. take into account the time since last used), but doing so is just a really complicated way of simulating an LRU cache. I experimented with it at one point, but found that it didn't perform any better than the simple LRU cache. At least not in my application.

Cassandra client code with high read throughput with row_cache optimization

Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?

Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?

Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.

Time series data indexing algorithm

My question is similar to this. I need data struture to store and access large amount of time series data. In my case insert rate is very hight - 10-100k inserts per second. Data items is a tuples that contains timestamp, sensor id and sensor value. And I have very large number of sensors. In my case values that is older than some point in time must be erased.
I need to query dataset by sensor id and time range. All the data must be stored in external memory, there is no way to fit it in main memory.
I know about TSB-tree already, but TSB-tree is hard to implement and there is no guarantee that it will do the job. I suspect that TSB-tree doesn't behave very good under high insert rate.
Is there any alternative? Maybe something like LSM-tree but for multidimentional data?

Because you're using external memory, you may want to read through the chapter on B-trees in Henrik Jonsson's thesis - B-trees themselves are a very popular way to index data in external memory and you should be able to find implementations in any language, and Jonnson discusses how to adapt them to store time series data.

Is it feasible to use a distributed cache for queryable data sets?

My scenario is as follows. I have a data table with a million rows of tuples (say first name and last name), and a client that needs to retrieve a small subset of rows whose first name or last name begins with the query string. Caching this seems like a catch-22, because:
On the one hand, I can't store and retrieve the entire data set on every request (would overwhelm the network)
On the other hand, I can't just store each row individually, because then I'd have no way to run a query.
Storing ranges of values in the cache, with a local "index" or directory would work... except that, you'd have to essentially duplicate the data for each index, which defeats the purpose of even using a distributed cache.
What approach is advisable for this kind of thing? Is it possible to get the benefits of using a distributed cache, or is it simply not feasible for this kind of scenario?

Distributed Caching, is feasible for query-able data sets.
But for this scenario there should either be native function or procedure that would give much faster results. If different scope are not possible like session or application then it would be much of iteration required on server side for fetching the data for each request.
Indexing on server side then of Database is never a good idea.
If still there are network issues. You could go ahead for Document Oriented or Column Oriented NoSQL DB. If feasible.

Duplicates from a stream

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion
will indicate the processing status of the particular record.
Problems will arise when we have this service running with zero
downtime and depending on the surge of data, this collection can
exceed the bounds of memory. Also, in case of system outages, this
data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will
only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe
but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks

you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk
in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).
and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).

You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio