Random number in range without repeating LABVIEW - random

I have this kind of a VI file and I need to add something in order to prevent numbers from repeating. Could you help?

Rather than filtering your list of numbers, you would be better off shuffling an array and taking the first few elements.
Excludes any possibility of duplication
Removes repeated checks
Can be easily adjusted for various ranges
Does not require any loops

Related

Iterative Hash Algorithm for Fast File Check

I want to create a representation of the state of all files in a folder (ignoring order), so that I can send this state to another computer to check if we are in sync. This "state representation" is 3 numbers concatenated by . which are:
sum . product . number of items
The "sum" is the numerical addition all of the file's md5 numerical representations.
The product is the multiplication of all of the file's md5 numerical representations.
The number of items is just the number of files.
The main reason for doing this is that this allows me to create unique states iteratively/quickly when I add or delete a file (a modification being a combination of delete then add). Also, one should end up with the same "state" even if the same set of operations are performed in any random order.
Adding A File
Generate the file's md5
Calculate the md5's numerical value (x).
Add x to the sum
Multiply the product by x
Increment the number of items.
Removing A File
Generate the file's md5
Calculate the md5's numerical value (x).
Subtract x from the sum
Divide the product by x
Decrement the number of items.
Problems
Since the numerical representations of hashes can be quite large, I may have to use a library to generate results using strings rather than integers which may be quite slow.
With the limited testing I have done, I have not been able to create "collisions" where a collision is where two different sets of file hashes could produce the same state (remember that we are ignoring the order of the file hashes).
Question
I'm sure that I can't be the first person to want to achieve such a thing. Is there an algorithm or iterative hash function that aims to do the same thing already, preferably in PHP, Java, or Python? Is there a term for this type of thing, all I could think of was "iterative hash"? Is there a flaw with this algorithm that I haven't spotted, such as with "collisions" making generated state representations non-unique?
How many states can your file system take ? infinity for all practical purposes.
How long is your hash length ? short enough to be efficient, finite in any case.
Will I get collisions ? Yes.
So, your hash approach seems fine, particularly if it spreads correctly points that are close, i.e. the state of the fs differing by content of just one file hashes to very different values.
However, you should depend on your hash to produce collisions in the long run, it's a mathematical certainty that probability goes to one that someday you get a collision, given that collision chance is not 0.
So to be really safe, you probably need a full MD5 exchange, if speed and fast updates are the goal your scheme sounds good, but I would back it up with more infrequent exchanges of longer keys, just to be on the safe side if sync is mission critical.

Algorithm for grouping RESTful routes

Given a list of URLs known to be somewhat "RESTful", what would be a decent algorithm for grouping them so that URLs mapping to the same "controller/action/view" are likely to be grouped together?
For example, given the following list:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
It would group them as follows:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
Nothing is known about the order or structure of the URLs ahead of time. In my example, it would be somewhat easy since the IDs are obviously numeric. Ideally, I'd like an algorithm that does a good job even if IDs are non-numeric (as in http://www.example.com/products/rocket and http://www.example.com/products/ufo).
It's really just an effort to say, "Given these URLs, I've grouped them by removing what I think it he 'variable' ID part of the URL."
Aliza has the right idea, you want to look for the 'articulation points' (in REST, basically where a parameter is being passed). Looking only for a single point of change gets tricky
Example
http://www.example.com/foo/1/new
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/bar/1/new
These can be grouped several equally good ways since we have no idea of the URL semantics. This really boils down to the question of this - is this piece of the URL part of the REST descriptor or a parameter. If we know what all the descriptors are, the rest are parameters and we are done.
Give a sufficiently large dataset, we'd want to look at the statistics of all URLs at each depth. e.g., /x/y/z/t/. We would count the number of occurrences in each slot and generate a large joint probability distribution table.
We can now look at the distribution of symbols. A high count in a slot means it's likely a parameter. We would start from the bottom, look for conditional probability events, ie., What is the probability of x being foo, then what is the probability y being something given x, etc. etc. I'd have to think more to determine a systematic way to extracting these, but it seems like a promisign start
split each url to an array of strings with the delimiter being '/'
e.g. http://www.example.com/foo/1/edit will give the array [http:,www.example.com,foo,1,edit]
if two arrays (urls) share the same value in all indecies except for one, they will be in the same group.
e.g. http://www.example.com/foo/1/edit = [http:,www.example.com,foo,1,edit] and
http://www.example.com/foo/2/edit = [http:,www.example.com,foo,2,edit]. The arrays match in all indices except for #3 which is 1 in the first array and 2 in the second array. Therefore, the urls belong to the same group.
It is easy to see that urls like http://www.example.com/foo/3 and http://www.example.com/foo/1/edit will not belong to the same group according to this algorithm.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

Algorithm for generating a 'top list' using word frequency

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?
Don't reinvent the wheel. Use a full text search engine such as Lucene.
The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.
At the end of the process sort the key/value pairs by count.
the basic idea is simple -- in executable pseudocode,
from collections import defaultdict
def process(words):
d = defaultdict(int)
for w in words: d[w] += 1
return d
Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).
On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).
Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).
Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.
Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.
Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?
Why not a simple map with key as the word and the Counter as the Value.
It will give the top used words, by taking the high value counter.
It is just a O(N) operation.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources