Sequential data output without writing to disk - algorithm

4kb memory
1gb data
8 bytes per data
Sequential data output without writing to disk.

First, you can only accumulate and output a maximum of 4kb per time through the data. So you will need at least something like 250,000 passes.
How can you accumulate stuff in a pass?
In pseudocode the idea is like this.
while not done:
for each data (8 bytes) in dataset:
if this data has never been output:
if it might belong in the current batch:
add to current batch (evicting something else if needed)
if current_batch not empty:
sort current batch
emit current batch
update "never been output" filter
else:
done
What does that filter look like? It needs to know three things:
What is the maximum value so far emitted?
How many times has it been emitted?
How many times has it been seen on this pass?
Any value below the maximum value gets ignored. After you've seen the value enough times, you can add it to the current batch.
Now how about the current batch you're accumulating? That can be a heap that tells you the maximum value in the batch. If the heap is not full, or if the current value is below the maximum in the batch, you add it to the batch and lose the current max.
If the heap is arranged in memory so that the smallest is first, when the batch is done you can remove the max, which will free up the last slot (that's how heaps work), and put the max there. Keep doing that and you'll heapsort the batch. Now you can easily update the filter, and then emit the batch.
I don't think you can get significantly more efficient than this.
If I was asked this in an interview, I'd know the answer, but I'd also see being asked the question as a sign that the company's hiring process is suboptimal. This would make me less inclined to be hired there unless there was some purpose I could see to why they hired this way. (I know why FAANGs do. But at most companies I'd call it a red flag.)

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Redis Stream Managing a time frame

I am writing events into Redis Stream.
But I would like to keep only the last 48 hours events.
According to the Redis documentations, I saw that I can manage my list size only using the MAXLEN which take affect by the records count and not by time frame.
Is there any way I can use the XADD function but to limit on insertion records oldest that the last 48 hours?
Thanks for the help!
This is yet not clear. I don't like the vanilla way of time capping a stream, that is, "trim by <seconds>", because it means that if there is a delay in the process XADD-ing items, later the next XADD will have to evict things potentially for seconds, causing latency spikes. Moreover it does not make a lot of sense semantically. Your real "capped resource" is memory, so it's not really so important how many items you want to store in the past VS how many items you can store, so the number of items limit makes more sense. Yet in certain applications where there are multiple streams with insertion rates that vary a lot between different producers, it makes sense to cap by time, to avoid wasting memory in certain producers that emit very few entries per unit of time. Maybe at some point I'll add some "best effort" time capping that does not do more work than a given amount, but that eventually will be able to trim the stream, given enough XADD calls.
AFAIK not yet. There were discussions about adding a timestamp cap (to XADD, and possible to XTRIM as well), but it doesn't look like this feature has been implemented in the latest release candidates.
A possible solution in nodejs based on trimming to a specified key (not on time per se).
https://gist.github.com/jakelowen/22cb8a233ac0cdbb8e77808e17e0e1fc
Proof of concept. Not battle tested.

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
This is a bit long for a comment.
The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.
It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.
What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

Parse, replacing large (several thousands) number of records

I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.

Resources