I have a simple question does the length of numbers that need to be sorted affects the sorting time ??
Example: Suppose we need to sort 10 million 6 digit numbers (like: 204134) and 10 million 2/3 digit numbers(like: 24, 143) and to sort both the sets individually. Does the set with 6 digit numbers is gonna take more time than the the one with 2/3 digit numbers ?
I know the hardware use each logic gate for a single digit so 6 logic gates for 6 digits compared to 2/3 gates for other set but i don't know whether this affects the sorting time or not. Can someone explain me this.
Helps will be appreciated.
Thanks
The hardware works with bits, not with decimal digits. Furthermore, the hardware always works with the same fixed amount of bits (for a given operation); smaller values are padded. For example, a 32 bit CPU usually has 32 bit comparator units with exactly as much circuitry needed for 32 bit comparisons, and uses those regardless of whether the values currently being compared would fit into fewer bits.
Another issue with your thinking is that the exact amount of logic gates doesn't matter much for performance. The propagation time of individual gates is much smaller than a clock cycle, only rather complicated circuits with long dependency chains actually take longer than a single cycle (and even then it might be pipelined to still get a throughput of 1 op per cycle). A surprisingly large number of logic gates in sequence (and an virtually unlimited number of logic gates in parallel) can easily finish their work within one clock cycle. Hence, a smart 64 bit comparison doesn't take more clock cycles than a 8 bit one.
The short answer: It depends, but probably not
The longer answer:
It's hard to know because you haven't said much about the hardware or the sorting algorithm. You mentioned later that you're using some MPI variant of Quicksort. So you're asking if there could be a performance difference between 6-bit numbers and 3-bit numbers due to the hardware. Well, if you pack those digits together then you're going to have better bandwidth when transferring the dataset from memory to the processor. Since you haven't mentioned anything about compacted arrays, I'll assume you're not doing this. Once the value is in the register it will have the same latency and throughput regardless of being 6 bits or 3 bits.
There are algorithms like radix sort that perform differently depending on the number of bits needed for your range of numbers. Since you're not using this, it doesn't apply.
Related
I'm learning about mining and the first thing that surprised me is that the nounce part of the algorithm which is supposed to be randomly looped until you get a number smaller than the target hash .. is just 32 bits long.
Can you explain why then is it so difficult to loop an unsigned int and how come is it increasingly difficult over time? Thank you.
The task is: try different nonce values in your potential block until you reach a block having a hash value below some given threshold.
I can't find the source right now, but I'm quite sure that since the introduction of special mining ASICs the 32-bit nonce is no longer enough to keep the miners busy for the planned 10 minutes interval between blocks. They are able to compute 4 billion block hashes in less than 10 minutes.
Increasing the difficulty didn't help anymore, as that reached the point where none of the 4 billion possible nonce values gave a hash below the threshold.
So they found some additional fields in the block that are now used as nonce-extension. The principle is still the same: try different values until you reach a block with a hash below the threshold, only now it's more than 32 bits that can be varied, allowing for the threshold to be lowered beyond the former 32-bit-implied barrier.
Because it's not just the 32bit nonce that is involved in the calculation. The 1MB of transaction data is also part of the mining input. There is then a non-trivial amount of arithmetic to arrive at the output, which then can be compared with the target.
Bitcoin mining is looping over all 4billion uints until you find a "right" one.
The way that difficulty is increased, is that only some of the bits of the output matter. E.g. early on the lowest 11 bits had to be some specific pattern, the remaining 21bits could be anything. In theory there would be 2million "right" values for each transaction block, uniformly distributed across the range of a uint. Then the "difficulty" is increased so that 13 bits have to be some pattern, so now there are 4x fewer "right" answers, so it takes (on average) 4x longer to find one.
a lookup table has a total of 4G entries, each entry of it is a 32bit arbitrary number but they never repeats.
is there any algorithm is able to utilize the index of each entry and its (index) value(32bit number)to make a fixed position bit of the value is always zero(so I can utilize the bit as a flag to log something). And I can retrieve the 32bit number by doing a reverse calculation.
Or step back and say, whether or not I can make a fixed position bit of every two continuous entries always zero?
my question is that is there any universal codes can make each arbitrary 32bit numeric save 1 bit. so I can utilize this bit as a lock flag. alternatively, is there a way can leverage the index and its value of a lookup table entry by some calculation to save 1 bit storage of the value.
It is not at all clear what you are asking. However I can perhaps find one thing in there that can be addressed, if I am reading it correctly, which is that you have a permutation of all of the integers in 0..232-1. Such a permutation can be represented in fewer bits than direct representation, which takes 32*232 bits. With a perfect representation of the permutations, each would be ceiling(log2(232!)) bits, since there are 232! possible permutations. That length turns out to be about 95.5% of the bits in the direct representation. So each permutation could be represented in about 30.6*232 bits, effectively taking off more than one bit per word.
My understanding is that PRNG's work by using an input seed and an algorithm that converts it to a very unrelated output, so that the next generated number is as unpredictable as possible. But here's the problem I see with it:
Any pseudorandom number generator that I can imagine has to have a finite number of outcomes. Let's say that I'm using a random number generator that can generate any number between 0 and one hundred billion. If I call for an output one hundred billion and one times, I can be certain that one number has been output more than once. If the same seed will always give the same output when put through an algorithm, then I can be sure that the PRNG will begin a loop. Where is my logic flawed here?
In the case that I am correct, if you know the algorithm for a PRNG, and that PRNG is being used for cryptography, can not this approach be used (and are there any measures in place to prevent it?):
Use the PRNG to generate the entire looping set of numbers possible.
Know the timestamp of when a private key was generated, and know the time AND -output of the PRNG later on
Based on how long it takes to calculate, determine how many numbers are between the known output and the unknown one
Lookup in the pre-generated list to find the generated number
You are absolutely right that in theory that approach can be used to break a PRNG, since, as you noted, given a sufficiently long sequence of outputs, you can start to predict what comes next.
The issue is that "sufficiently long" might be so long that this approach is completely impractical. For example, the Mersenne twister PRNG, which isn't designed for cryptographic use, has a period of 219,937 - 1, which is so long that it's completely infeasible to attempt the attack that you're describing.
Generally speaking, imagine that a pseudorandom generator uses n bits of internal storage. That gives 2n possible internal configurations of those bits, meaning that you may need to see 2n + 1 outputs before you're guaranteed to see a repeat. Given that most cryptographically secure PRNGs use at least 256 bits of internal storage, this makes your attack infeasible.
One detail worth noting is that there's a difference between "the PRNG repeats a number" and "from that point forward the numbers will always be the same." It's possible that a PRNG will repeat an output multiple times before moving on to output a different number next, provided that the internal state is different each time.
You are correct, a PRNG produces a long sequence of numbers and then repeats. For ordinary use this is usually sufficient. Not for cryptographic use, as you point out.
For ideal cryptographic numbers, we need to use a true RNG (TRNG), which generates random numbers from some source of entropy (= randomness in this context). Such a source may be a small piece of radioactive material on a card, thermal noise in a disconnected microphone circuit or other possibilities. A mixture of many different sources will be more resistant to attacks.
Commonly such sources of entropy do not produce enough random numbers to be used directly. That is where PRNGs are used to 'stretch' the real entropy to produce more pseudo random numbers from the smaller amount of entropy provided by the TRNG. The entropy is used to seed the PRNG and the PRNG produces more numbers based on that seed. The amount of stretching allowed is limited, so the attacker never gets a long enough string of pseudo-random numbers to do any worthwhile analysis. After the limit is reached, the PRNG must be reseeded from the TRNG.
Also, the PRNG should be reseeded anyway after every data request, no matter how small. There are various cryptographic primitives that can help with this, such as hashes. For example, after every data request, a further 128 bits of data could be generated, XOR'ed with any accumulated entropy available, hashed and the resulting hash output used to reseed the generator.
Cryptographic RNGs are slower than ordinary PRNGs because they use slow cryptographic primitives and because they take extra precautions against attacks.
For an example of a CSPRNG see Fortuona
It's possible to create truly random number generators on a PC because they are undeterministic machines.
Indeed, with the complexity of the hierarchical memory levels, the intricacy of the CPU pipelines, the coexistence of innumerable processes and threads activated at arbitrary moments and competing for resources, and the asynchronism of the I/O devices, there is no predictable relation between the number of operations performed and the elapsed time.
So looking at the system time every now and then is a perfect source randomness.
Any pseudorandom number generator that I can imagine has to have a finite number of outcomes.
I don't see why that's true. Why can't it have gradually increasing state, failing when it runs out of memory?
Here's a trivial PRNG algorithm that never repeats:
1) Seed with any amount of data unknown to an attacker as the seed.
2) Compute the SHA512 hash of the data.
3) Output the first 256 bits of that hash.
4) Append the last byte of that hash to the data.
5) Go to step 2.
And, for practical purposes, this doesn't matter. With just 128 bits of state, you can generate a PRNG that won't repeat for 340282366920938463463374607431768211456 outputs. If you pull a billion outputs a second for a billion years, you won't get through a billionth of them.
Both merge sort and quick sort can work in parallel. Each time we split a problem in two sub-problems we can run those sub-problems in parallel. However it looks sub-optimal.
Suppose we have 4 CPUs. On the 1st iteration we split the problem in only 2 sub-problems and two CPUs are idle. On the 2nd iteration all CPUs are busy but on the 3d iteration we do not have enough CPUs. So, we should adapt the algorithm for the case when CPUs << log(N).
Does it make sense? How would you adapt the sorting algorithms to these cases?
First off, the best parallel implementation will depend highly on the environment. Some factors to consider:
Shared Memory (a 4-core computer) vs. Not Shared (4 single-core computers)
Size of data to sort
Speed of comparing two elements
Speed of swapping/moving two elements
Memory available
Is each computer/core identical or are there differences in speeds, network latency to communicate between parts, cache effects, etc.
Fault tolerance: what if one computer/core broke down in the middle of the operation.
etc.
Now moving back to the theoretical:
Suppose I have 1024 cards, and 7 other people to help me sort them.
Merge Sort
I quickly split the stack into 8 sections of somewhat equal size. It won't be perfectly equal since I am going fast. Actually since my friends can start sorting their part as soon as they get their section, I should give my first friend a stack bigger than the rest and get smaller towards the end.
Each person sorts their part however they like sequentially. (radix sort, quick sort, merge sort, etc.)
Now for the hard part ... merging.
In real life I would probably have the first two people that are ready form a pair and start merging their decks together. Perhaps they could work together, one person merging from the front and the other from the back. Perhaps they could both work from the front while calling their numbers out.
Soon enough other people will be done with their individual sorting, and can start merging. I would have them form pairs as they find convenient and keep going until all the cards are merged.
Quick Sort
The real trick here is to try to parallelize the partitioning, since the rest is pretty easy to do.
I will start by breaking the stack into 8 parts, and hand one part out to each friend. While doing this, I will choose one of the cards that looks like it might end up towards the middle of the sorted deck. I call out that number.
Each of my friends will partition their smaller stack into three piles, less than the called out number, equal to the called out number, and greater than the called out number. If one friend is faster than the others, he/she can steal some cards from a neighboring friend.
When they are finished with that, I collect all the less thans into one pile and give that to friends 0 through 3, I set aside the equal to's, and give the greater's to friends 4 through 7.
Friends 0 through 3, will divide their stack into four somewhat equal parts, will choose a card to partition around, and repeat the process amongst themselves.
This repeats until each friend has their own stack.
(Note that if the partitioning card wasn't chosen well, rather than dividing up the work 50-50, maybe I would only assign 2 friends to work on the less thans, and let the other 6 work on the greater thans.)
At the end, I just collect all of the stacks in the right order, along with the partition cards.
Conclusion
While it is true that some approaches are faster on a computer than in real life, I think the preceding is a good start. Different computers or cores or threads will perform their work at different speeds, unless you are implementing the sort in hardware. (If you are, you might want to look into "Sorting Networks" and or "Optimal Sorting Networks").
If you are sorting numbers, you will need a large dataset to be helped by paralellizing it.
However, if you are sorting images by comparing the sum manhattan distance between corresponding pixel red green blue values. You will find it less difficult to get speed-up of just less than k times with k cpu's.
Lastly, you will want to time the sequential version(s), and compare as you go along, since, cache effects, memory usage, network costs, etc, might just might make a difference.
I have an function that is engineered as follows:
int brutesearch(startNumber,endNumber);
this function returns the correct number if one matches my criteria by performing a linear search, or null if it's not found in the searched numbers.
Say that:
I want to search all 6 digits numbers to find one that does something I want
I can run the brutesearch() function multithreaded
I have a laptop with 4 cores
My question is the following:
What is my best bet for optimising this search? Dividing the number space in 4 segments and running 4 instances of the function one on each core? Or dividing for example in 10 segments and running them all together, or dividing in 12 segments and running them in batches of 4 using a queue?
Any ideas?
Knowing nothing about your search criteria (there may be other considerations created by the memory subsystem), the tradeoff here is between the cost of having some processors do more work than others (e.g., because the search predicate is faster on some values than others, or because other threads were scheduled) and the cost of coordinating the work. A strategy that's worked well for me is to have a work queue from which threads grab a constant/#threads fraction of the remaining tasks each time, but with only four processors, it's pretty hard to go wrong, though the really big running-time wins are in algorithms.
There is no general answer. You need to give more information.
If your each comparison is completely independent of the others, and there are no opportunities for saving computation in a global resource, there is say no global hash tables involved, and your operations are all done in a single stage,
then your best bet is to just divide your problem space into the number of cores you have available, in this case 4 and send 1/4 of the data to each core.
For example if you had 10 million unique numbers that you wanted to test for primality. Or if you had 10 million passwords your were trying to hash to find a match, then just divide by 4.
If you have a real world problem, then you need to know a lot more about the underlying operations to get a good solution. For example if a global resource is involved, then you won't get any improvement from parallelism unless you isolate the operations on the global resource somehow.