Duplicates from a stream

Duplicates from a stream - algorithm

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion
will indicate the processing status of the particular record.
Problems will arise when we have this service running with zero
downtime and depending on the surge of data, this collection can
exceed the bounds of memory. Also, in case of system outages, this
data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will
only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe
but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks

you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk
in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).
and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).

You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)

Related

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks

I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.

Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

How is Hazelcast distributed cache faster than a call to the DB?

Lets say there I have 2 servers that are using Hazelcasts distributed cache. If on server #1, I store 2 items in a map in that distributed cache. One of those items will be saved in the local back up, and the other will be stored in the backup of the other servers Hazelcast instance(Please correct me if that is incorrect).
My question is, if I try to retrieve the second item from the cache(stored in the backup on server #2), a TCP call will be made to retrieve that data. How is this faster than just calling the DB?

First of all let me correct how data is stored on Hazelcast.
Hazelcast uses a distribution algorithm based on consistent hashing, meaning the hashing algorithm returns the same output for the same input all the time. This distribution is not 100% equal distribution but for high number of elements pretty good and cost effective. That said it doesn't mean you'll have one element on each node in the worst case.
By default Hazelcast also keeps on backup, that means each node will have both elements (in a 2 node setup), either owned data or as a backup for failure case. You can make backups readable (read-from-backup=true), however that introduces a slight chance to read staled data (time between owner is updated but backup is not yet).
In addition data in Hazelcast, again by default, is stored in serialized form, means binary streamable representation.
Ok so how can all this be faster than a TCP connection to your database?
The answer is twofold:
Hazelcast is a key-value store. Therefore it is optimized for requesting data by key and answering with the value as quickly as possible.
Data is already serialized, therefore the byte stream is just "smashed" into the socket without any real further work to be done.
Your database on the other hand has to really query data from a table. The internal data structures to hold the information is optimized for complex queries but not to access on a key base. But, and this is important, current database implementation optimize internally (in RAM) for fast access too. So the effect will only happen for databases that serve under high load. Caches (local or distributed) are designed to speed up slow operations, resulting in: if your database is blazingly fast you won't see a benefit.
Anyways designing a system you expect to grow exponentially you should consider caching right from the start. A comprehensive introduction into caching and the behind ideas is available in a caching whitepaper and article I wrote some time ago: https://dzone.com/articles/caching-why-you-should-care
I hope this answers your question :-)

Distributed cache with huge objects ~1-2GB

I have a need to stream a huge dataset, around 1-2GB, but only on demand when they explore the data. For example, if they don't explore parts of the data, I don't want to send it out.
So now, I have a solution that effectively returns JSON only for things they need. The need for a cache arises because these 1-2GB objects are actually constructed in memory from a file or files on disk, so the latency is ~30 seconds if you have to read the file to return this data.
How do I manage such a cache? Basically I think the solution is something like ZooKeeper or such where I store the physical machine name which holds the cache and then forward my rest request to that.
Would you guys also consider this to be the right model? I wonder what kind of checks will I have to do such that if the node that has the cache goes down, I can still fullfil the request without an error, but higher latencies.
Has anybody developed such a system? All the solutions out there are for seemingly small rows or objects.

https://github.com/golang/groupcache is used for bigger things, but although it's used by http://dl.google.com, I'm not sure how it would do with Multi-gigabyte objects.
On the other hand, HTTP can do partial transfers and will be very efficient at that. Take a look ar Varnish, or Nginx.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.

Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.

I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

Is it feasible to use a distributed cache for queryable data sets?

My scenario is as follows. I have a data table with a million rows of tuples (say first name and last name), and a client that needs to retrieve a small subset of rows whose first name or last name begins with the query string. Caching this seems like a catch-22, because:
On the one hand, I can't store and retrieve the entire data set on every request (would overwhelm the network)
On the other hand, I can't just store each row individually, because then I'd have no way to run a query.
Storing ranges of values in the cache, with a local "index" or directory would work... except that, you'd have to essentially duplicate the data for each index, which defeats the purpose of even using a distributed cache.
What approach is advisable for this kind of thing? Is it possible to get the benefits of using a distributed cache, or is it simply not feasible for this kind of scenario?

Distributed Caching, is feasible for query-able data sets.
But for this scenario there should either be native function or procedure that would give much faster results. If different scope are not possible like session or application then it would be much of iteration required on server side for fetching the data for each request.
Indexing on server side then of Database is never a good idea.
If still there are network issues. You could go ahead for Document Oriented or Column Oriented NoSQL DB. If feasible.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio