as I'm well aware, using AppendTo[] for large lists isn't recommended as the function gets progressively slower with each appending done.
Lots of suggestions talk about using Reap[] and Sow[], however the ones I found don't deal with nested lists.
My question is relatively simple, how to substitute AppendTo[] with Reap[], Sow[] as directly as possible?
Specific information regarding my problem: I would like to append data in form of
data = {{x},{y}}
to a list, so after a few iterations, the list would look something like
list = {{{x1},{y1}},{{x2},{y2}},{{x3},{y3}}}
My solution works for only 1 iteration, and then breaks apart because of increasing use of Flatten[], so it's obviously a non-working solution. Of course, I'm open to other alternatives, faster than AppendTo[].
Thanks!
Related
I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table
I'm a bit confused about what data structure I can use to be able to do following tasks rather fast:
Save tuples (can be changed to contain a keyword). Will be something like {UserInfo, Time, TimeLvl}
Remove element knowing the tuple (or the keyword)
Update all contained elements, changing one of the tuple's elements about once a second (TimeLvl will get higher the longer the user waits).
The Contained data will change a lot as users come and go.
What would be the best data-structure for this use case?
Take a look at this article: Key-Value stores.
Than decide which of the data structures presented is best suited for you.
The article also provides a benchmark.
I personally like gb_trees, which is quite fast and easy to use.
Take a look at gproc
It should do what you want, it is very efficient, and made by one of the creator of Erlang, so robust enough.
You can check some gproc capabilities here, then you will know if it fits your problem
EDIT 1 :
After further search, updating the Value of a gproc entry can be done with gproc:set_value(Key, Value).
EDIT 2 :
So you will use :
gproc:reg({n, l, YouKey}, YourValue) %% YouValue will be the tuple
gproc:set_value(YourKey, YourValue)
gproc:unreg({n, l, YourKey})
I fail to understand, why is using listFindNoCase() and ListFind() the preferred way of doing a series of OR and IS/EQ comparison? Wouldn't the JVM be able to optimize it and produce efficient code, rather then making a function call that has to deal with tokenizing a string? Or is CF doing something much more inefficient??
Use listFindNoCase() or listFind() instead of the is and or operators
to compare one item to multiple items. They are much faster.
http://www.adobe.com/devnet/coldfusion/articles/coldfusion_performance.html
The answer is simple: Type conversion. You can can compare a 2 EQ "2" or now() EQ "2011-01-01", or true EQ "YES". The cost of converting (to multiple types ) and comparing is quite high.
ListFind() does not need to try multiple conversions, so it is much faster.
This is the price of dynamic typing.
I find this odd too. The only thing I can think of is that the list elements are added to a fast collection that check if an element exists based on some awesome hash of the elements it contains. This would in fact be faster for large or very large lists. The smaller lists should show little or no speed boost.
I got a wordlist which is 56GB and I would like to remove doubles.
I've tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance,
Sir Troll
edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...
a.txt
b.txt
c.txt
-
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char
e.g:
aa.txt
ab.txt
ac.txt
...
Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this
I suggest you use a Bloom Filter for this.
For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).
It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.
I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?
You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
test-1
test-2
test-3
Excel will continue it with:
test-4
test-5
test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
test-blue-somethingelse
test-yellow-somethingelse
test-red-somethingelse
Now based on this entries I want say that the pattern is:
test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
if (s1.charAt(s1I) != s2.charAt(s2I))
break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
if (s1.charAt(s1E) != s2.charAt(s2E))
break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.