How does hGetContents achieve memory efficiency? - lazy-evaluation

I want to add Haskell to my toolbox so I'm working my way through Real World Haskell.
In the chapter in Input and Output, in the section on hGetContents, I came across this example:
import System.IO
import Data.Char(toUpper)
main :: IO ()
main = do
inh <- openFile "input.txt" ReadMode
outh <- openFile "output.txt" WriteMode
inpStr <- hGetContents inh
let result = processData inpStr
hPutStr outh result
hClose inh
hClose outh
processData :: String -> String
processData = map toUpper
Following this code sample, the authors go on to say:
Notice that hGetContents handled all of the reading for us. Also, take a look at processData. It's a pure function since it has no side effects and always returns the same result each time it is called. It has no need to know—and no way to tell—that its input is being read lazily from a file in this case. It can work perfectly well with a 20-character literal or a 500GB data dump on disk. (N.B. Emphasis is mine)
My question is: how does hGetContents or its resultant values achieve this memory efficiency without – in this example – processData "being able to tell", and still maintain all benefits that accrue to pure code (i.e. processData), specifically memoization?
<- hGetContents inh returns a string so inpStr is bound to a value of type String, which is exactly the type that processData accepts. But if I understand the authors of Real World Haskell correctly, then this string isn't quite like other strings, in that it's not fully loaded into memory (or fully evaluated, if such a things as not-fully-evaluated strings exists...) by the time of the call to processData.
Therefore, another way to ask my question is: if inpStr is not fully evaluated or loaded into memory at the time of the call to processData, then how can it be used to lookup if a memoized call to processData exists, without first fully evaluating inpStr?
Are there instances of type String that each behave differently but cannot be told apart at this level of abstraction?

The String returned by hGetContents is no different from any other Haskell string. In general, Haskell data is not fully evaluated unless the programmer has taken extra steps to ensure that it is (e.g. seq, deepseq, bang patterns).
Strings are defined as (essentially)
data List a = Nil | Cons a (List a) -- Nil === [], Cons === :
type String = List Char
This means that a string is either empty, or a single character (the head) and another string (the tail). Due to laziness, the tail may not exist in memory, and may even be infinite. Upon processing a String, a Haskell program will typically check if it's Nil or Cons, then branch and proceed as necessary. If the function doesn't need to evaluate the tail, it won't. This function, for example, only needs to check the initial constructor:
safeHead :: String -> Maybe Char
safeHead [] = Nothing
safeHead (x:_) = Just x
This is a perfectly legitimate string
allA's = repeat 'a' :: String
that is infinite. You can manipulate this string sensibly, however if you try to print all of it, or calculate the length, or any sort of unbounded traversal your program won't terminate. But you can use functions like safeHead without any problem whatsoever, and even consume some finite initial substring.
Your intuition that something strange is happening is correct, however. hGetContents is implemented using the special function unsafeInterleaveIO, which is essentially a compiler hook into IO behavior. The less said about this, the better.
You're correct that it would be difficult to check if a memoized call to a function exists without having the argument fully evaluated. However, most compilers don't perform this optimization. The problem is that it's very difficult for a compiler to determine when it's worthwhile to memoize calls, and very easy to consume all of your memory by doing so. Fortunately there are several memoizing libraries you can use to add memoization when appropriate.

Related

Go Ints and Strings are immutable OR mutable?

What I am reading about ints and strings over internet is they are immutable in the nature.
But the following code shows that after changing the values of these types, still they points to the same address. This contradicts the idea behind the nature of types in python.
Can anyone please explain me this?
Thanks in advance.
package main
import (
"fmt"
)
func main() {
num := 2
fmt.Println(&num)
num = 3
fmt.Println(&num) // address value of the num does not change
str := "2"
fmt.Println(&str)
str = "34"
fmt.Println(&str) // address value of the str does not change
}```
A number is immutable by nature. 7 is 7, and it won't be 8 tomorrow. That doesn't mean that which number is stored in a variable cannot change. Variables are variable. They're mutable containers for values which may be mutable or immutable.
A Go string is immutable by language design; the string type doesn't support any mutating operators (like appending or replacing a character in the middle of the string). But, again, assignment can change which string a variable contains.
In Python (CPython at least), a number is implemented as a kind of object, with an address and fields like any other object. When you do tricks with id(), you're looking at the address of the object "behind" the variable, which may or may not change depending on what you do to it, and whether or not it was originally an interned small integer or something like that.
In Go, an integer is an integer. It's stored as an integer. The address of the variable is the address of the variable. The address of the variable might change if the garbage collector decides to move it (making the numeric value of the address more or less useless), but it doesn't reveal to you any tricks about the implementation of arithmetic operators, because there aren't any.
Strings are more complicated than integers; they are kind of object-ish internally, being a structure containing a pointer and a size. But taking the address of a string variable with &str doesn't tell you anything about that internal structure, and it doesn't tell you whether the Go compiler decided to use a de novo string value for an assignment, or to modify the old one in place (which it could, without breaking any rules, if it could prove that the old one would never be seen again by anything else). All it tells you is the address of str. If you wanted to find out whether that internal pointer changed you would have to use reflection... but there's hardly ever any practical reason to do so.
When you read about a string being immutable, it means you cannot modify it by index, ex:
x := "hello"
x[2] = 'r'
//will raise an error
As a comment says, when you modify the whole var(and not a part of it with an index), it's not related to being mutable or not, and you can do it

Java-8 stream expression to 'OR' several enum values together

I am aggregating a bunch of enum values (different from the ordinal values) in a foreach loop.
int output = 0;
for (TestEnum testEnum: setOfEnums) {
output |= testEnum.getValue();
}
Is there a way to do this in streams API?
If I use a lambda like this in a Stream<TestEnum> :
setOfEnums.stream().forEach(testEnum -> (output |= testEnum.getValue());
I get a compile time error that says, 'variable used in lambda should be effectively final'.
Predicate represents a boolean valued function, you need to use reduce method of stream to aggregate bunch of enum values.
if we consider that you have HashSet as named SetOfEnums :
//int initialValue = 0; //this is effectively final for next stream pipeline if you wont modify this value in that stream
final int initialValue = 0;//final
int output = SetOfEnums.stream()
.map(TestEnum::getValue)
.reduce(initialValue, (e1,e2)-> e1|e2);
You nedd to reduce stream of enums like this:
int output = Arrays.stream(TestEnum.values()).mapToInt(TestEnum::getValue).reduce(0, (acc, value) -> acc | value);
I like the recommendations to use reduction, but perhaps a more complete answer would illustrate why it is a good idea.
In a lambda expression, you can reference variables like output that are in scope where the lambda expression is defined, but you cannot modify the values. The reason for that is that, internally, the compiler must be able to implement your lambda, if it chooses to do so, by creating a new function with your lambda as its body. The compiler may choose to add parameters as needed so that all of the values used in this generated function are available in the parameter list. In your case, such a function would definitely have the lambda's explicit parameter, testEnum, but because you also reference the local variable output in the lambda body, it could add that as a second parameter to the generated function. Effectively, the compiler might generate this function from your lambda:
private void generatedFunction1(TestEnum testEnum, int output) {
output |= testEnum.getValue();
}
As you can see, the output parameter is a copy of the output variable used by the caller, and the OR operation would only be applied to the copy. Since the original output variable wouldn't be modified, the language designers decided to prohibit modification of values passed implicitly to lambdas.
To get around the problem in the most direct way, setting aside for the moment that the use of reduction is a far better approach, you could wrap the output variable in a wrapper (e.g. an int[] array of size 1 or an AtomicInteger. The wrapper's reference would be passed by value to the generated function, and since you would now update the contents of output, not the value of output, output remains effectively final, so the compiler won't complain. For example:
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.set(output.get() | testEnum.getValue()));
or, since we're using AtomicInteger, we may as well make it thread-safe in case you later choose to use a parallel Stream,
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.getAndUpdate(prev -> prev | testEnum.getValue())));
Now that we've gone over an answer that most resembles what you asked about, we can talk about the superior solution of using reduction, that other answers have already recommended.
There are two kinds of reduction offered by Stream, stateless reduction (reduce(), and stateful reduction (collect()). To visualize the difference, consider a conveyer belt delivering hamburgers, and your goal is to collect all of the hamburger patties into one big hamburger. With stateful reduction, you would start with a new hamburger bun, and then collect the patty out of each hamburger as it arrives, and you add it to the stack of patties in the hamburger bun you set up to collect them. In stateless reduction, you start out with an empty hamburger bun (called the "identity", since that empty hamburger bun is what you end up with if the conveyer belt is empty), and as each hamburger arrives on the belt, you make a copy of the previous accumulated burger and add the patty from the new one that just arrived, discarding the previous accumulated burger.
The stateless reduction may seem like a huge waste, but there are cases when copying the accumulated value is very cheap. One such case is when accumulating primitive types -- primitive types are very cheap to copy, so stateless reduction is ideal when crunching primitives in applications such as summing, ORing, etc.
So, using stateless reduction, your example might become:
setOfEnums.stream()
.mapToInt(TestEnum::getValue) // or .mapToInt(testEnum -> testEnum.getValue())
.reduce(0, (resultSoFar, testEnum) -> resultSoFar | testEnum);
Some points to ponder:
Your original for loop is probably faster than using streams, except perhaps if your set is very large and you use parallel streams. Don't use streams for the sake of using streams. Use them if they make sense.
In my first example, I showed the use of Stream.forEach(). If you ever find yourself creating a Stream and just calling forEach(), it is more efficient just to call forEach() on the collection directly.
You didn't mention what kind of Set you are using, but I hope you are using EnumSet<TestEnum>. Because it is implemented as a bit field, It performs much better (O(1)) than any other kind of Set for all operations, even copying. EnumSet.noneOf(TestEnum.class) creates an empty Set, EnumSet.allOf(TestEnum.class) gives you a set of all enum values, etc.

how to improve this very slow and inefficient Haskell program to process binary files byte by byte?

I am trying to write a hexdump like program in Haskell. I wrote the following program, I am glad that it works and gives desired output but it is very slow and inefficient. It was adapted from the program given in this answer.
I ran the program with a sample file, and it takes about 1 minute to process that less than 1MB file. The standard Linux hexdump program does the job in less about a second. All I want to do in the program is read->process->write all individual bytes in a bytestring.
Here is the question - How to efficiently read/process/write the bytestring (byte by byte, i.e. without using any other functions like getWord32le, if that's what is needed)? I want to do arithmetical and logical operations on each individual byte not necessarily on the Word32le or a group of bytes like that. I didn't find any data type like Byte.
Anyway, here is the code I wrote, which runs successfully on ghci (version 7.4) -
module Main where
import Data.Time.Clock
import Data.Char
import qualified Data.ByteString.Lazy as BIN
import Data.ByteString.Lazy.Char8
import Data.Binary.Get
import Data.Binary.Put
import System.IO
import Numeric (showHex, showIntAtBase)
main = do
let infile = "rose_rosebud_flower.jpg"
let outfile = "rose_rosebud_flower.hex"
h_in <- openFile infile ReadMode
System.IO.putStrLn "before time: "
t1 <- getCurrentTime >>= return . utctDayTime
System.IO.putStrLn $ (show t1)
process_file h_in outfile
System.IO.putStrLn "after time: "
t2 <- getCurrentTime >>= return . utctDayTime
System.IO.putStrLn $ (show t2)
hClose h_in
process_file h_in outfile = do
eof <- hIsEOF h_in
if eof
then return ()
else do bin1 <- BIN.hGet h_in 1
let str = (Data.ByteString.Lazy.Char8.unpack) bin1
let hexchar = getHex str
System.IO.appendFile outfile hexchar
process_file h_in outfile
getHex (b:[]) = (tohex $ ord b) ++ " "
getHex _ = "ERR "
tohex d = showHex d ""
When I run it on the ghci I get
*Main> main
before time:
23254.13701s
after time:
23313.381806s
Please provide a modified (but complete working) code as answer and not just the list of names of some functions. Also, don't provide solutions that use jpeg or other image processing libraries as I am not interested in image processing. I used the jpeg image as example non-text file. I just want to process data byte by byte. Also don't provide links to other sites (especially to the documentation (or the lack of it) on the Haskell site). I cannot understand the documentation for bytestring and for many other packages written on the Haskell site, their documentation (which is just type signatures collected on a page, in most cases) seems only meant for the experts, who already understand most of the stuff. If I could figure out the solution by reading their documentation or even the much advertised (real world haskell) RWH book, I'd not have asked this question in the first place.
Sorry for the seeming rant, but the experience with Haskell is frustrating as compared to many other languages, especially when it comes to doing even simple IO as the Haskell IO related documentation with small complete working examples is almost absent.
Your example code reads one byte at a time. That's pretty much guaranteed to be slow. Better still, it reads a 1-byte ByteString and then immediately converts it to a list, negating all the benefits of ByteString. Best of all, it writes to the output file by the slightly strange method of opening the file, appending a single character, and then closing the file again. So for every individual hex character written out, the file has to be completely opened, wound to the end, have a character appended, and then flushed to disk and closed again.
I'm not 100% sure what you're trying to achieve here (i.e., trying to learn how stuff works vs trying to make a specific program work), so I'm not sure exactly how best to answer your question.
If this is your very first foray into Haskell, starting with something I/O-centric is probably a bad idea. You would be better off learning the rest of the language before worrying about how to do high-performance I/O. That said, let me try to answer your actual question...
First, there is no type named "byte". The type you're looking for is called Word8 (if you want an unsigned 8-bit integer) or Int8 (if you want a signed 8-bit integer — which you probably don't). There are also types like Word16, Word32, Word64; you need to import Data.Word to get them. Similarly, Int16, Int32 and Int64 live in Data.Int. The Int and Integer types are automatically imported, so you don't need to do anything special for those.
A ByteString is basically an array of bytes. A [Word8], on the other hand, is a single-linked list of individual bytes which may or may not be computed yet — much less efficient, but far more flexible.
If literally all you want to do is apply a transformation to every single byte, independent of any other byte, then the ByteString package provides a map function which will do exactly that:
map :: (Word8 -> Word8) -> ByteString -> ByteString
If you just want to read from one file and write to another, you can do that using so-called "lazy I/O". This is a neat dodge where the library handles all the I/O chunking for you. It has a few nasty gotchas though; basically revolving around it being hard to know exactly when the input file will get closed. For simple cases, that doesn't matter. For more complicated cases, it does.
So how does it work? Well, the ByteString library has a function
readFile :: FilePath -> IO ByteString
It looks like it reads the entire file into a giant ByteString in memory. But it doesn't. It's a trick. Actually it just checks that the file exists, and opens it for reading. When you try to use the ByteString, in the background the file invisibly gets read into memory as you process it. That means you can do something like this:
main = do
bin <- readFile "in_file"
writeFile "out_file" (map my_function bin)
This will read in_file, apply my_function to every individual byte of the file, and save the result into out_file, automatically doing I/O in large enough chunks to give good performance, but never holding more than one chunk in RAM at once. (The my_function part must have type Word8 -> Word8.) So this is both very simple to write, and should be extremely fast.
Things get fun if you don't want to read the entire file, or want to access the file in random order, or anything complicated like that. I am told that the pipes library is the thing to look at, but personally I've never used it.
In the interests of a complete working example:
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Numeric
main = do
bin <- BIN.readFile "in_file"
BIN.writeFile "out_file" (BIN.concatMap my_function bin)
my_function :: Word8 -> BIN.ByteString
my_function b =
case showHex b "" of
c1:c2:_ -> BIN.pack [fromIntegral $ fromEnum $ c1 , fromIntegral $ fromEnum $ c2] -- Get first two chars in hex string, convert Char to Word8.
c2:_ -> BIN.pack [fromIntegral $ fromEnum $ '0', fromIntegral $ fromEnum $ c2] -- Only one digit. Assume first digit is zeor.
Note that because one byte becomes two hex digits, I've used the ByteString version of concatMap, which allows my_function to return a whole ByteString rather than just a single byte.

Purity of Memoized Functions in D

Are there any clever ways of preserving purity when memoizing functions in D?
I want this when caching SHA1-calculations of large datasets kept in RAM.
Short answer: Pick memoization or purity. Don't try and have both.
Long answer: I don't see how it would be possible to preserve purity with memoization unless you used casts to lie to the compiler and claim that a function is pure when it isn't, because in order to memoize, you have to store the arguments and the result, which breaks purity, since the number one guarantee of pure functions is that they don't access mutable global or static variables (which is the only way that you'd be able to memoize anything).
So, if you did something like
alias pure nothrow Foo function() FuncType;
auto result = (cast(FuncType)&theFunc)();
then you can treat theFunc as if it were pure when it isn't, but then it's up to you to ensure that the function acts pure from the outside - including dealing with the fact that the compiler thinks that it can change the mutability of the return type of a strongly pure function which returns a mutable type. For instance, this code will compile just fine
char[] makeString(size_t len) pure
{
return new char[](len);
}
void main()
{
char[] a = makeString(5);
const(char)[] b = makeString(5);
const(char[]) c = makeString(5);
immutable(char)[] d = makeString(5);
immutable(char[]) e = makeString(5);
}
even though the return type is always mutable. And that's because the compiler knows that makeString is strongly pure and returns a value which could not have been passed to it - so, it's guaranteed to be a new value every time - and therefore changing changing the mutability of the return type to const or immutable doesn't violate the type system.
If you were to do something inside of makeString that involved casting a function to pure when it violated the guarantee that makeString always returned a new value, then you'd have broken the type system, and you'd be risking having very buggy code depending on what you did with the values returned from makeString.
The only way that I'm aware of getting purity when you don't have it is to cast a function pointer so that it's pure, but if you do that, then you must fully understand what guarantees a pure function makes and what the compiler thinks that it can do with it so that you fully mimic that behavior. That's easier if you're returning immutable data or a value type, because then you don't have the issue of the compiler changing the mutability of the return type, but it's still very tricky business.
So, if you're thinking about casting something to pure, think again. Yes, it's possible to do some stuff that way that you couldn't otherwise, but it's very risky. Personally, I'd advise that you decide whether purity matters more to you or memoization matters more to you and that you drop the other. Anything else is highly risky.
What D allows to express within the type system is an impure function that memoizes a pure one.
Conceptually the memoizer is also pure, but the type system is not sufficiently expressive to allow that. You'd need to cheat somewhere.

Caching of data in Mathematica

there is a very time-consuming operation which generates a dataset in my package. I would like to save this dataset and let the package rebuild it only when I manually delete the cached file. Here is my approach as part of the package:
myDataset = Module[{fname, data},
fname = "cached-data.mx";
If[FileExistsQ[fname],
Get[fname],
data = Evaluate[timeConsumingOperation[]];
Put[data, fname];
data]
];
timeConsumingOperation[]:=Module[{},
(* lot of work here *)
{"data"}
];
However, instead of writing the long data set to the file, the Put command only writes one line: "timeConsumingOperation[]", even if I wrap it with Evaluate as above. (To be true, this behaviour is not consistent, sometimes the dataset is written, sometimes not.)
How do you cache your data?
Another caching technique I use very often, especially when you might not want to insert the precomputed form in e.g. a package, is to memoize the expensive evaluation(s), such that it is computed on first use but then cached for subsequent evaluations. This is readily accomplished with SetDelayed and Set in concert:
f[arg1_, arg2_] := f[arg1, arg2] = someExpensiveThing[arg1, arg2]
Note that SetDelayed (:=) binds higher than Set (=), so the implied order of evaluation is the following, but you don't actually need the parens:
f[arg1_, arg2_] := ( f[arg1, arg2] = someExpensiveThing[arg1, arg2])
Thus, the first time you evaluate f[1,2], the evaluation-delayed RHS is evaluated, causing resulting value is computed and stored as an OwnValue of f[1,2] with Set.
#rcollyer is also right in that you don't need to use empty brackets if you have no arguments, you could just as easily write:
g := g = someExpensiveThing[...]
There's no harm in using them, though.
In the past, whenever I've had trouble with things evaluating it is usually when I have not correctly matched the pattern required by the function. For instance,
f[x_Integers]:= x
which won't match anything. Instead, I meant
f[x_Integer]:=x
In your case, though, you have no pattern to match: timeConsumingOperation[].
You're problem is more likely related to when timeConsumingOperation is defined relative to myDataset. In the code you've posted above, timeConsumingOperation is defined after myDataset. So, on the first run (or immediately after you've cleared the global variables) you would get exactly the result you're describing because timeConsumingOperation is not defined when the code for myDataset is run.
Now, SetDelayed (:=) automatically causes the variable to be recalculated whenever it is used, and since you do not require any parameters to be passed, the square brackets are not necessary. The important point here is that timeConsumingOperation can be declared, as written, prior to myDataset because SetDelayed will cause it not to be executed until it is used.
All told, your caching methodology looks exactly how I would go about it.

Resources