Imagine you have a number of accounts that record a number or money and you want to process very large numbers of operations deducting and adding to this number.
But the number of transactions is more than a single machine can handle. The list of ALL accounts and an integer fits on a single machine.
So we shard the accounts and each machine remembers a different partition of the account integer. The SUM(all machines of account X) = the overall balance
If there is a deduct operation and the account integer is greater or equal to the deduction amount, then this operation can be processed locally by that machine without coordination communication.
If a machine tries to deduct more than that machine has, the machine must communicate and coordinate with other machines.
I am thinking of a recursive DNS architecture where if a machine doesn't have enough of the integer, it would need to broadcast the deduction amount to other machines. Then the machine and that machine would have to coordinate an update to the new amount for each machine after the operation. This is expensive in terms of round trip time and adds latency.
How would you implement strongly consistent reads in this design?
Related
What is the difference between Effective access time and Average access time.(Please tell from "Operating system" and "computer organization" point of view)
More often than not, we ignore weights while finding arithmetic means. Effective access time and average access time have a very subtle difference between them.
Say, I have a memory with access time 100. I also have a cache with hit rate of 90% and access time of 10. Now, we need to find the 'average' access time for the memory.
We know that 90% of time, the access time will be 10 and for the remaining 10% of the time, the access time will be 100***. So, effectively, the access time for the system will be (90/100)*10+(10/100)*100. This is referred to as effective access time. In statistical term, weighted average.
Average access time simply means the two weights are equal. Or in other words, the two events are equally probable and therefore will contribute equally to the final mean of the system. In that case, the average will be
(50/100)*10+(50/100)*100= (1/2)(100+10) which is the average which we have been using ever since(add the two and divide it by 2).
*** The access time will be more than 100 since we need to account for the cache search time as well and also the bus latency. The example is just cooked up and does not represent accurate modeling of the access time.
What deterministic algorithm is suitable for the following resource allocation/scheduling problem?
Consider a set of players: P1, P2, P3 and P4. Each player receives data from a cell tower (e.g. in a wireless network). The tower transmits data in 1 second blocks. There are 5 blocks. Each player can be scheduled to receive data in an arbitrary number of the blocks.
Now, the amount of data received in each block is a constant (C) divided by the number of other players scheduled in the same block (because the bandwidth must be shared). A greedy approach would allocate each player to each block but then the data received per block would be reduced.
How can we find an allocation of the players to time-blocks so that the amount of data delivered by the network is maximised? I have tried a number of heuristic methods on this problem (Genetic Algorithm, Sim Anneal) and they work well. However, Id like to solve for the optimum schedule.
Pretend I have a client service that needs true random Integer values (4 bytes) every 10 seconds.
As such, I acquire a piece of hardware that generates true random values based on atmospheric noise. The device can generate up to 8 bytes of random data per second.
As I stand now, every 10 seconds, my client service can query the device, and pull 4 bytes out of the 8 generated bytes. The value is used by the client service instantly, and is considered true random.
Now pretend I instantiate 3 new client services (total of 4), running the same algorithm. The services are all synchronized together, so they will query the device at the same time.
What happens now is that, at the 10 second mark, only 2 of the services (out of 4) will receive a random value immediately, and the other 2 services will have to wait up to 1 full second before receiving their value. This is undesirable.
Since I'd rather maximize the use of my expensive device, I come up with this solution: the software sitting on the server (where the device is connected) will actually be collecting all values from the device, and store them in a queue (which will be dequeued automatically if it grows too big for the RAM). Now, when a client service makes a query, the random value will be dequeued from that queue instead of pulled directly from the device. Like before, each random value is only used once, but in this case, some of the values in the queue could have been sitting there for a long time.
I fear that I might be doing things the wrong way with this solution. I can't shake off the nagging feeling that, using a value that was generated in the past, and is not 'fresh', I am somehow turning this back into a pseudo-random generator. Are my fears correct, or unsubstantiated?
At the end of the day, the only thing that matters is that your random function produces a sequence of statistically random values.
To that end, it doesn't matter whether your implementation gets the values one at a time, or all at once and puts them in a queue. As long as they are sufficiently random, they are fine.
I have to agree with Colonel Thirty Two. It's not as if the client services can pick and choose a specific time or value that they would like to use. As long as the value has not already been used it shouldn't be an issue.
Something that you could do is clear your generated value after all 4 devices have requested a new value from the device. Since you said that they were all connected and querying at the same time, this should be every 10 seconds. That should be an ample amount of time to generate at least 4 new random numbers.
I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.
My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.
I am very interested in estimating these costs:
How many instances should I select to perform this task? What are
the capacity of the instance (the size of the master instance and
the map-reduce instance)? Can I deduct these capacities and costs
based on n and k as well as each operation cost?
I have designed an example data flow: I used one xlarge instance as
my master node, and 10 medium instances as my map reduce group.
Would this be enough?
How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?
1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.
2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.
3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.
I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.
I'm not even sure the following is possible, but it never hurts to ask:
I have two nodes running the same application. Each machine needs a sequence generator that will give it a number between 0 and 1e6. If a node has used a number, the other node must not use it. The generator should reset every night at midnight. No number should be used twice in the same day, even if a machine restarts. We'd like to avoid any solution involving databases, distributed caches or filesystems. Let's assume we will never need more than 1e6 numbers per day. The numbers do not have to be used in sequence.
So far we have thought of the following:
1) Machine A uses odd numbers, machine B uses even numbers.
Pros: no shared state.
Cons: a machine might run out of numbers when there are plenty left. If a machine restarts, it will reuse previously used numbers.
2) Machine A countr from 0 to 1e6, machine B from 1e6 to 0.
Pros: no shared state. Garantees that all available numbers will be consumed before running into problems.
Cons: doesn't scale to more than two machines. Same problem when a machine restarts.
What do you think? Is there a magic algorithm that will fulfill our requirements without needing to write anything to disk?
No number should be used twice in the same day, even if a machine restarts.
Since you don't want to use any persistent state, this suggests to me that the number must depend on the time somehow. That is the only way in which the algorithm can tell two distinct startups apart. Can't you just use a combination (node, timestamp) for sufficiently fine timestamps, instead of your numbers?
Why not just have a small serivce that hands out IDs upon request? This scales to more than one machine, and doesn't require a change to the client if you need to change the ID allocation algorithm. This is rather simple to implement and quite easy to maintain going forward.
I really think the best way would be to have some machine that hands out numbers on requests (maybe even number ranges if you want to avoid too many queries) that wrote things out to disk.
If you're really against it, you could be really clever with method 1 if you can gaurantee the rate at which numbers are consumed. For example the machine could use the current time to determine where in its range to begin. I.E. if it's noon, begin at the middle of my range. This could be tweaked if you can put an upper limit on the amount of numbers generated per second (or generic time interval). This still has the problem of wasted tags and is pretty convoluted just to avoid writing a single number to disk.