Cache distribution exercise as presented at googles HashCode 2017 - algorithm

I'm currently trying to find an efficient solution to the problem stated in this document Hash Code 2017 - Streaming Videos.
TLDR; To minimize latency of youtube videos, cache serves with limited
capacity are used. However not every cache is connected to every
endpoint and not every entpoint requests the same videos. The goal is
to minimize overall latency of the whole network.
My approach was to simply iterate through each endpoint and each requests block and find the optimal cache with the most latency reduction per video size (I'll just call it request density).
When the optimal cache has already reached its capacity, I try to store it in exchange for videos with less request density or use a different cache if there is no other possibility (notice that the data center is also a cache in my model).
def distribute_video_requests(endpoint, excluding_caches=set()):
caches = endpoint.cache_connections - excluding_caches
for vr in endpoint.video_requests:
optimal_cache = find_optimum(caches, vr)
exchange = try_put(optimal_cache, vr)
if exchange["conflicting"]:
excluding_caches.add(optimal_cache)
for elm in exchange["affected"]:
distribute_video_requests(elm["from"], excluding_caches)
for ep in endpoints:
distribute_video_requests(ep)
You could visualize it as Brazil nut effect where the video requests are pieces with different density which are sorted in a stack.
The reason I'm explaining all of this is because I can't realy tell if my solution is decent and if it isn't: what are better approaches for this?

If somebody gives you a proposed solution, one thing you could do is pick on one of the cache servers, empty it, and then try and work out the best way to fill it up to get a solution at least as good as the proposed one.
I think this is the knapsack problem, so it will not be easy to find an efficient exact solution to this, or to the original problem.
There are decent approximations to the knapsack problem so I think it might be worth programming it up and throwing it at the solutions from your method. If it can't improve on the original solution much, congratulations! If it can, you have another solution method - keep running the knapsack problem to adjust the contents of each cache server until you can't find any more improvements.

I've actually solved this problem using basic OOP, stream based data reading and writing, and basic loops.
My solution is infact available at: https://github.com/TheBlackPlague/YouTubeCache .
The solution is coded in PHP just because I wanted an interpreted language to do this quickly in rather than a compiled one. However, this can easily be extended to any language to speed up execution times.

Related

How to estimate the CPU/memory needed for a backend app? How do you do it? Or is this just impossible?

I need to estimate the server (CPU/memory) needed for my backend app. I would appreciate it if I could know how to do that.
If you are interested, here are my (failed) attempts:
Thought 1: Stress testing. The problem is that it seems hard to estimate the following quantities without a large error. For example DAU, average QPS, QPS peak, request distribution (e.g. how many people are requesting the hot data?), data size, data distribution (e.g. what is the working set size?). Then, suppose I have a 2x error on every single item, and I have 5 items that are multiplied together, then I will have a 32x error which is huge. Even if the stress testing is perfect, I already have a 32x error because the estimation of the quantities is coarse.
Thought 2: Look at the resources needed for similar backends. Even if I find such "similar" backends, IMHO it is still hard because: For example, different language (Java/Go/Python/...) can be a lot faster/slower; they can have different implementation or complexity; the code contains different (slow) SQLs; they can have a different amount of data; the QPS can be different; etc.
Hopefully, my thoughts are wrong or there are other methods!

How p2p search engines could prevent corruption of distributed index by malicious peers?

As a hobby I'm writing simple and primitive distributed web search engine and it occurred to me it currently has no protection against malicious peers trying to skew search results.
Current architecture of the project is storing inverse index and ranking factors in kad dht with peers updating this inverse index as they crawl web.
I've used google scholar in attempt to find some solution but it seems most of the authors of proposed p2p web search ignore above-mentioned problem.
I think I need some kind of reputation system or trust metrics, but my knowledge in this domain is sufficiently lacking and I would very much appreciate a few pointers.
One way you could avoid this is to only use reliable nodes for storing and retrieving values. The reliability of a node will have to be computed by known-good nodes, and it could be something like the similarity of a node's last few computed ranking factors compared to the same ranking factors computed by known-good nodes (i.e. compare the node's scores for google.com to known-good scores for google.com). Using this approach, you'll need to avoid the "rogue reliable node" problem (for example, by using random checks or reducing all reliability scores randomly).
Another way you could approach this is to duplicate computation of ranking factors across multiple nodes, fetch all of the values at search time, and rank them on the client side (using variance, for example). You could also limit searches to sites that only have >10 duplicate values computed, so that there is some time before new sites are ranked. Additionally, any nodes with values outside of the normal range could be reported by the client in the background, and their reliability scores could be computed this way. This approach is time-consuming for the end user (unless you replicate known-good results to known-good nodes for faster lookups).
Also, take a look at this paper which describes a sybil-proof weak-trust system (which, as the authors explain, is more robust than the impossible sybil-proof strong-trust system): http://www.eecs.harvard.edu/econcs/pubs/Seuken_aamas14.pdf
The problem you are describing is Byzantine General’s problem or Byzantine Fault Tolerance. You can read more about it on wikipedia but there must be plenty of papers written about it.
I don’t remember the exact algorithm, but basically it’s mathematically proven that for t traitors (malicious peers) you will need 3*t + 1 peers in total, in order to detect the traitors.
My general thought would be, this is a huge overhead in implementation and resource waste on the indexing side, and while there is enough research to be done in distributed indexing and distributed search, not many people are tackling it yet. Also the problem has been basically solved with the Byzantine General’s it “just" needs to be implemented on top of an existing (and working) distributed search engine.
If you don't mind having a time delay on index updates, you could opt for a block-chain algorithm similar to what bitcoin uses to secure funds.
Changes to the index (deltas only!) can be represented in a text or binary file format, and crunched by peers who accept a given block of deltas. A malicious peer would have to out-compute the rest of the network for a period of time in order to skew the index in their favor.
I believe the bitcoin hashing algorithm (SHA-256) to be flawed in that custom hardware renders the common users' hardware useless. A block chain using the litecoin's algorithm (scrypt) would work well, because cpus and gpus are effective tools in the computation.
You would weigh the difficulty accordingly, so that news block are produced on a fairly regular schedule -- maybe 2-5 minutes. A user of the search engine could posibly choose to use the index at least 30 minutes old, to guarantee that enough users in the network vouch for its contents.
more info:
https://en.bitcoin.it/wiki/Block_chain
https://en.bitcoin.it/wiki/Block_hashing_algorithm
https://litecoin.info/block_hashing_algorithm
https://www.coinpursuit.com/pages/bitcoin-altcoin-SHA-256-scrypt-mining-algorithms/

Efficient way to represent locations, and query based on proximity?

I'm pondering over how to efficiently represent locations in a database, such that given an arbitrary new location, I can efficiently query the database for candidate locations that are within an acceptable proximity threshold to the subject.
Similar things have been asked before, but I haven't found a discussion based on my criteria for the problem domain.
Things to bear in mind:
Starting from scratch, I can represent data in any way (eg. long&lat, etc)
Any result set is time-sensitive, in that it loses validity within a short window of time (~5-15mins) so I can't cache indefinitely
I can tolerate some reasonable margin of error in results, for example if a location is slightly outside of the threshold, or if a row in the result set has very recently expired
A language agnostic discussion is perfect, but in case it helps I'm using C# MVC 3 and SQL Server 2012
A couple of first thoughts:
Use an external API like Google, however this will generate thousands of requests and the latency will be poor
Use the Haversine function, however this looks expensive and so should be performed on a minimal number of candidates (possibly as a Stored Procedure even!)
Build a graph of postcodes/zipcodes, such that from any node I can find postcodes/zipcodes that border it, however this could involve a lot of data to store
Some optimization ideas to reduce possible candidates quickly:
Cache result sets for searches, and when we do subsequent searches, see if the subject is within an acceptable range to a candidate we already have a cached result set for. If so, use the cached result set (but remember, the results expire quickly)
I'm hoping the answer isn't just raw CPU power, and that there are some approaches I haven't thought of that could help me out?
Thank you
ps. Apologies if I've missed previously asked questions with helpful answers, please let me know below.
What about using GeoHash? (refer to http://en.wikipedia.org/wiki/Geohash)

What is the difference between an on-line and off-line algorithm?

These terms were used in my data structures textbook, but the explanation was very terse and unclear. I think it has something to do with how much knowledge the algorithm has at each stage of computation.
(Please, don't link to the Wikipedia page: I've already read it and I am still looking for a clarification. An explanation as if I'm twelve years old and/or an example would be much more helpful.)
Wikipedia
The Wikipedia page is quite clear:
In computer science, an online algorithm is one that can process its
input piece-by-piece in a serial fashion, i.e., in the order that the
input is fed to the algorithm, without having the entire input
available from the start. In contrast, an offline algorithm is given
the whole problem data from the beginning and is required to output an
answer which solves the problem at hand. (For example, selection sort
requires that the entire list be given before it can sort it, while
insertion sort doesn't.)
Let me expand on the above:
An offline algorithm requires all information BEFORE the algorithm starts. In the Wikipedia example, selection sort is offline because step 1 is Find the minimum value in the list. To do this, you need to have the entire list available - otherwise, how could you possibly know what the minimum value is? You cannot.
Insertion sort, by contrast, is online because it does not need to know anything about what values it will sort and the information is requested WHILE the algorithm is running. Simply put, it can grab new values at every iteration.
Still not clear?
Think of the following examples (for four year olds!). David is asking you to solve two problems.
In the first problem, he says:
"I'm, going to give you two balls of different masses and you need to
drop them at the same time from a tower.. just to make sure Galileo
was right. You can't use a watch, we'll just eyeball it."
If I gave you only one ball, you'd probably look at me and wonder what you're supposed to be doing. After all, the instructions were pretty clear. You need both balls at the beginning of the problem. This is an offline algorithm.
For the second problem, David says
"Okay, that went pretty well, but now I need you to go ahead and kick
a couple of balls across a field."
I go ahead and give you the first ball. You kick it. Then I give you the second ball and you kick that. I could also give you a third and fourth ball (without you even knowing that I was going to give them to you). This is an example of an online algorithm. As a matter of fact, you could be kicking balls all day.
I hope this was clear :D
An online algorithm processes the input only piece by piece and doesn't know about the actual input size at the beginning of the algorithm.
An often used example is scheduling: you have a set of machines, and an unknown workload. Each machine has a specific speed. You want to clear the workload as fast as possible. But since you don't know all inputs from the beginning (you can often see only the next in the queue) you can only estimate which machine is the best for the current input. This can result in non-optimal distribution of your workload since you cannot make any assumption on your input data.
An offline algorithm on the other hand works only with complete input data. All workload must be known before the algorithm starts processing the data.
Example:
Workload:
1. Unit (Weight: 1)
2. Unit (Weight: 1)
3. Unit (Weight: 3)
Machines:
1. Machine (1 weight/hour)
2. Machine (2 weights/hour)
Possible result (Online):
1. Unit -> 2. Machine // 2. Machine has now a workload of 30 minutes
2. Unit -> 2. Machine // 2. Machine has now a workload of one hour
either
3. Unit -> 1. Machine // 1. Machine has now a workload of three hours
or
3. Unit -> 2. Machine // 1. Machine has now a workload of 2.5 hours
==> the work is done after 2.5 hours
Possible result (Offline):
1. Unit -> 1. Machine // 1. Machine has now a workload of one hour
2. Unit -> 1. Machine // 1. Machine has now a workload of two hours
3. Unit -> 2. Machine // 2. Machine has now a workload of 1.5 hours
==> the work is done after 2 hours
Note that the better result in the offline algorithm is only possible since we don't use the better machine from the start. We know already that there will be a heavy unit (unit 3), so this unit should be processed by the fastest machine.
An offline algorithm knows all about its input data the moment it is invoked. An online algorithm, on the other hand, can get parts or all of its input data while it is running.
An
algorithm is said to be online if it does not
know the data it will be executing on
beforehand. An offline algorithm may see
all of the data in advance.
An on-line algorithm is one that receives a sequence of requests and performs an immediate action in response to each request.
In contrast,an off-line algorithm performs action after all the requests are taken.
This paper by Richard Karp gives more insight on on-line,off-line algorithms.
We can differentiate offline and online algorithms based on the availability of the inputs prior to the processing of the algorithm.
Offline Algorithm: All input information are available to the algorithm and processed simultaneously by the algorithm. With the complete set of input information the algorithm finds a way to efficiently process the inputs and obtain an optimal solution.
Online Algorithm: Inputs come on the fly i.e. all input information are not available to the algorithm simultaneously rather part by part as a sequence or over the time. Upon the availability of an input, the algorithm has to take immediate decision without any knowledge of future input information. In this process, the algorithm produces a sequence of decisions that will have an impact on the final quality of its overall performance.
Eg: Routing in Communication network:
Data Packets from different sources come to the nearest router. More than one communication links are connected to the router. When a new data packet arrive to the router, then the router has to decide immediately to which link the data packet is to be sent. (Assume all links are routed to the destination, all links bandwidth are same, all links are the part of the shortest path to the destination). Here, the objective is to assign each incoming data packet to one of the link without knowing the future data packets in such a way that the load of each link will be balanced. No links should be overloaded. This is a load balancing problem.
Here, the scheduler implemented in the router has no idea about the future data packets, but it has to take scheduling decision for each incoming data packets.
In the contrast a offline scheduler has full knowledge about all incoming data packets, then it efficiently assign the data packets to different links and can optimally balance the load among different links.
Cache Miss problem: In a computer system, cache is a memory unit used to avoid the speed mismatch between the faster processor and the slowest primary memory. The objective of using cache is to minimize the average access time by keeping some frequently accessed pages in the cache. The assumption is that these pages may be requested by the processor in near future. Generally, when a page request is made by the processor then the page is fetched from the primary or secondary memory and a copy of the page is stored in the cache memory. Suppose, the cache is full, then the algorithm implemented in the cache has to take immediate decision of replacing a cache block without knowledge of future page requests. The question arises: which cache block has to be replaced? (In worst case, it may happen that you replace a cache block and very next moment, the processor request for the replaced cache block).
So, the algorithm must be designed in such a way that it take immediate decision upon the arrival of an incoming request with out advance knowledge of entire request sequence. This type of algorithms are known as ONLINE ALGORITHM

Machine scheduling problem

i have a combinatoric problem as such:
You are given N testers.
Each tester is one of M different types.
Each tester can be configured to use one of P different configs.
.
You have L lots of products to test,
Each product can only be tested on specific Tester type,
Each product can only be tested by Tester configured with specific Configs. Some of the Configs can be applied on multiple products.
Any tester can change its config during production, but each change on tester config will incur additional time U .
Each lot has a lot size that determines its test-time, Q.
Now i need to come out a lot scheduling algorithm such that the time to finish testing all lots is minimum.
What are the best approaches to tackle this kind of problem ?
It can be modelled as a Job-Shop problem (JSP) with setup times where the job size is 1. Unfortunately it gets pretty hard to find an optimum when the number of jobs > 10.
There are many free solver implementations out there that contain Job-Shop as a sample problem:
If you're using C++, Gecode is good. If you're free to choose, ECLiPSe Prolog contains source code for the JSP.
If you can do with a good solution (instead of an optimal one), I suggest using a greedy algorithm (for the JSP, greedy algorithms give solutions typically within 10% of the optimum - I had some experience with this). I'm gonna think about one and get back here (the problem is what are called 'setup time constraints', i.e. the constraints coming from changing tester configuration).

Resources