When to use references versus types versus boxes and slices versus vectors as arguments and return types? - memory-management

I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!

Related

Programming Chess in Functional programming

A year ago I programmed a chess AI using the Alphabeta prunning algorithm. This was relatively straight forward to do in c++. One of the main issues I considered while doing this was making my code efficient. I did this by using having a data type I called a "game" that I passed around through the search tree made by the algorithm. To increase efficiency I didn't ever copy the "game" data type but rather mutated it while keeping the nessisary information needed to return it to its previous states.
Recently I have been reading about functional programming and the concept of purely using functions that do not change the state of the parameters they are passed appeals to me. I am wondering how I would using the paradigm of functional programming while still taking efficiency of the program into account.
In OOP the solution seems quite straight forward (which is what I implemented) while in functional programming it seems that copying data types is nessisary which decreases efficiency. Is it possible to use functional programming without this loss of efficiency?
In functional programming, data structures are not always copied completely. In many cases, only the part that changes needs to be copied, while the old parts can be referenced (since no mutation is allowed, this is safe).
The article on persistant data structures describes this in more detail.
Jephron's answer points out the important fact that only small parts of a persistent data structure need to get updated, thus the bigger part is shared between the old state and the new state.
To be honest, this would still be slower than a mutation in most cases.
But immutable, persistent data structures have other advantages. Let's assume you have already completed the playing engine. And now, you want to implement a history (for example to allow the player to undo earlier moves). This is dead simple: Just remember all states in a list. You'll find that you need to touch only a few functions to take a list of states instead of just the last state, and you're done. You don't need to worry about compromising your game engine --- there is no global variable or something you could destroy.
Another thing is taking advantage of the many CPU cores you probably have by employing parallelism. Needless to say that you can't let many tasks, threads, fibers or whatever operate on a single mutable data structure. This would just become a synchronization nightmare, and your code would probably go slower even. However, there simply are no synchronization problems on immutable data, as they are read only for all threads.
This could very well speed up your code in such a way that it dwarfs the C++ solution, even if "doing a move" on a functional data structure is much slower than on mutable data.
Here is an example for changing a board game (TTT) from single threaded to parallel: https://dierk.gitbooks.io/fregegoodness/content/src/docs/asciidoc/incremental_episode4.html

Any documentation/article about the `&MyType{}` pattern in golang?

In most golang codebases I look, people are using types by reference:
type Foo struct {}
myFoo := &Foo{}
I usually take the opposite approach, passing everything as copy and only pass by reference when I want to perform something destructive on the value, which allows me to easily spot destructive functions (and which is fairly rare).
But seeing how references are commonplace, I guess it's not just a matter of taste. I get there's a cost in duplicating values, is it that much of a game changer? Or are there other reasons why references are preferred?
It would be great if someone could point me to an article or documentation about why references are preferred.
Thanks!
Go is pass by value. I try to use references like in your example as much as possible to remove the mental process of thinking about not making duplicates of objects. Go is mostly meant for networking & scaling, which makes performance a priority. Obvious downside of this is as you say, receiving methods can destroy the object that the pointer points to.
Otherwise there is no rule as to which you should use. Both are quite ok.
Also, somewhat related to the question, from the Go docs: Pointers vs. Values

When to reuse functions?

I have a function in my program that generates random strings.
func randString(s []rune, l int) string
s is a slice of runes containing the possible characters in the string. I pass
in a rune slice of both capital and lowercase alphabetic characters. l
determines the length of the string. This works great. But I also need to
generate random hex strings for html color codes.
It seems all sources say that it's good programming practice to reuse code. So I
made another []rune that held [1-9a-f] and feed that into randString. That
was before I realized that the stdlib already inclues formatting verbs for int
types that suit me perfectly.
In practice, is it better to reuse my randString function or code a separate
(more efficient) function? I would generate a single random int and Sprintf it
rather than having to loop and generate 6 random ints which randString does.
1) If there is an exact solution in the standard library, you should like always choose to use that.
Because:
The standard library is tested. So it does what it says (or what we expect it to do). Even if there is a bug in it, it will be discovered (by you or by others) and will get fixed without your work/effort.
The standard library is written as idiomatic Go. Chances are it's faster even if it does a little more than what you need compared to the solution you could write.
The standard library is (or may) improve by time. Your program may get faster just because an implementation was improved in a new Go release without any effort from your part.
The solution is presented (which means it's ready and requires no time from you).
The standard library is well and widely known, so your code will be easier to understand by others and by you later on.
If you're already imported the package (or will in the near future), this means zero or minimal overhead as libraries are statically linked, so the function you need is already linked to your program (to the compiled executable binary).
2) If there is a solution provided by the standard library but it is a general solution to similar problems and/or offers more than what you need:
That means it's more likely not the optimal solution for you, as it may use more memory and/or work more slowly as your solution could be.
You need to decide if you're willing to sacrifice that little performance loss for the gains listed above. This also depends how and how many times you need to use it (e.g. if it's a one-time, it shouldn't matter, if it's in an endless loop called very frequently, it should be examined carefully).
3) And at the other end: you should avoid using a solution provided by the standard library if it wasn't designed to solve your problem...
If it just happens that its "side-effect" solves your problem: Even if the current implementation would be acceptable, if it was designed for something else, future improvements to it could render your usage of it completely useless or could even break it.
Not to mention it would confuse other developers trying to read, improve or use your code (you included, after a certain amount of time).
As a side note: this question is exactly about the function you're trying to create: How to generate a random string of a fixed length in golang? I've presented mutiple very efficient solutions.
This is fairly subjective and not go-specific but I think you shouldn't reuse code just for the sake of reuse. The more code you reuse the more dependencies you create between different parts of your app and as result it becomes more difficult to maintain and modify. Easy to understand and modify code is much more important especially if you work in a team.
For your particular example I would do the following.
If a random color is generated only once in your package/application then using fmt.Sprintf("#%06x", rand.Intn(256*256*256)) is perfectly fine (as suggested by Dave C).
If random colors are generated in multiple places I would create function func randColor() string and call it. Note that now you can optimize randColor implementation however you like without changing the rest of the code. For example you could have implemented randColor using randString initially and then switched to a more efficient implementation later.

How do you write data structures that are as efficient as possible in GHC? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So sometimes I need to write a data structure I can't find on Hackage, or what I find isn't tested or quality enough for me to trust, or it's just something I don't want to be a dependency. I am reading Okasaki's book right now, and it's quite good at explaining how to design asymptotically fast data structures.
However, I am working specifically with GHC. Constant factors are a big deal for my applications. Memory usage is also a big deal for me. So I have questions specifically about GHC.
In particular
How to maximize sharing of nodes
How to reduce memory footprint
How to avoid space leaks due to improper strictness/laziness
How to get GHC to produce tight inner loops for important sections of code
I've looked around various places on the web, and I have a vague idea of how to work with GHC, for example, looking at core output, using UNPACK pragmas, and the like. But I'm not sure I get it.
So I popped open my favorite data structures library, containers, and looked at the Data.Sequence module. I can't say I understand a lot of what they're doing to make Seq fast.
The first thing that catches my eye is the definition of FingerTree a. I assume that's just me being unfamiliar with finger trees though. The second thing that catches my eye is all the SPECIALIZE pragmas. I have no idea what's going on here, and I'm very curious, as these are littered all over the code.
Many functions also have an INLINE pragma associated with them. I can guess what that means, but how do I make a judgement call on when to INLINE functions?
Things get really interesting around line ~475, a section headered as 'Applicative Construction'. They define a newtype wrapper to represent the Identity monad, they write their own copy of the strict state monad, and they have a function defined called applicativeTree which, apparently is specialized to the Identity monad and this increases sharing of the output of the function. I have no idea what's going on here. What sorcery is being used to increase sharing?
Anyway, I'm not sure there's much to learn from Data.Sequence. Are there other 'model programs' I can read to gain wisdom? I'd really like to know how to soup up my data structures when I really need them to go faster. One thing in particular is writing data structures that make fusion easy, and how to go about writing good fusion rules.
That's a big topic! Most has been explained elsewhere, so I won't try to write a book chapter right here. Instead:
Real World Haskell, ch 25, "Performance" - discusses profiling, simple specialization and unpacking, reading Core, and some optimizations.
Johan Tibell is writing a lot on this topic:
Computing the size of a data structure
Memory footprints of common data types
Faster persistent structures through hashing
Reasoning about laziness
And some things from here:
Reading GHC Core
How GHC does optimization
Profiling for performance
Tweaking GC settings
General improvements
More on unpacking
Unboxing and strictness
And some other things:
Intro to specialization of code and data
Code improvement flags
applicativeTree is quite fancy, but mainly in a way which has to do with FingerTrees in particular, which are quite a fancy data structure themselves. We had some discussion of the intricacies over at cstheory. Note that applicativeTree is written to work over any Applicative. It just so happens that when it is specialized to Id then it can share nodes in a manner that it otherwise couldn't. You can work through the specialization yourself by inlining the Id methods and seeing what happens. Note that this specialization is used in only one place -- the O(log n) replicate function. The fact that the more general function specializes neatly to the constant case is a very clever bit of code reuse, but that's really all.
In general, Sequence teaches more about designing persistent data structures than about all the tricks for eeking out performance, I think. Dons' suggestions are of course excellent. I'd also just browse through the source of the really canonical and tuned libs -- Map, IntMap, Set, and IntSet in particular. Along with those, its worth taking a look at Milan's paper on his improvements to containers.

Reimplementing data structures in the real world

The topic of algorithms class today was reimplementing data structures, specifically ArrayList in Java. The fact that you can customize a structure for in various ways definitely got me interested, particularly with variations of add() & iterator.remove() methods.
But is reimplementing and customizing a data structure something that is of more interest to the academics vs the real-world programmers? Has anyone reimplemented their own version of a data structure in a commercial application/program, and why did you pick that route over your particular language's implementation?
Knowing how data structures are implemented and can be implemented is definitely of interest to everyone, not just academics. While you will most likely not reimplement a datastructure if the language already provides an implementation with suitable functions and performance characteristics, it is very possible that you will have to create your own data structure by composing other data structures... or you may need to implement a data structure with slightly different behavior than a well-known data structure. In that case, you certainly will need to know how the original data structure is implemented. Alternatively, you may end up needing a data structure that does not exist or which provides similar behavior to an existing data structure, but the way in which it is used requires that it be optimized for a different set of functions. Again, such a situation would require you to know how to implement (and alter) the data structure, so yes it is of interest.
Edit
I am not advocating that you reimplement existing datastructures! Don't do that. What I'm saying is that the knowledge does have practical application. For example, you may need to create a bidirectional map data structure (which you can implement by composing two unidirectional map data structures), or you may need to create a stack that keeps track of a variety of statistics (such as min, max, mean) by using an existing stack data structure with an element type that contains the value as well as these various statistics. These are some trivial examples of things that you might need to implement in the real world.
I have re-implemented some of a language's built-in data structures, functions, and classes on a number of occasions. As an embedded developer, the main reason I would do that is for speed or efficiency. The standard libraries and types were designed to be useful in a variety of situations, but there are many instances where I can create a more specialized version that is custom-tailored to take advantage of the features and limitations of my current platform. If the language doesn't provide a way to open up and modify existing classes (like you can in Ruby, for instance), then re-implementing the class/function/structure can be the only way to go.
For example, one system I worked on used a MIPS CPU that was speedy when working with 32-bit numbers but slower when working with smaller ones. I re-wrote several data structures and functions to use 32-bit integers instead of 16-bit integers, and also specified that the fields be aligned to 32-bit boundaries. The result was a noticable speed boost in a section of code that was bottlenecking other parts of the software.
That being said, it was not a trivial process. I ended up having to modify every function that used that structure and I ended up having to re-write several standard library functions as well. In this particular instance, the benefits outweighed the work. In the general case, however, it's usually not worth the trouble. There's a big potential for hard-to-debug problems, and it's almost always more work than it looks like. Unless you have specific requirements or restrictions that the existing structures/classes don't meet, I would recommend against re-implementing them.
As Michael mentions, it is indeed useful to know how to re-implement structures even if you never do so. You may find a problem in the future that can be solved by applying the principles and techniques used in existing data structures.

Resources