Is there a quick-starting Haskell interpreter suitable for scripting? - performance

Does anyone know of a quick-starting Haskell interpreter that would be suitable for use in writing shell scripts? Running 'hello world' using Hugs took 400ms on my old laptop and takes 300ms on my current Thinkpad X300. That's too slow for instantaneous response. Times with GHCi are similar.
Functional languages don't have to be slow: both Objective Caml and Moscow ML run hello world in 1ms or less.
Clarification: I am a heavy user of GHC and I know how to use GHCi. I know all about compiling to get things fast. Parsing costs should be completely irrelevant: if ML and OCaml can start 300x faster than GHCi, then there is room for improvement.
I am looking for
The convenience of scripting: one source file, no binary code, same code runs on all platforms
Performance comparable to other interpreters, including fast startup and execution for a simple program like
module Main where
main = print 33
I am not looking for compiled performance for more serious programs. The whole point is to see if Haskell can be useful for scripting.

Why not create a script front-end that compiles the script if it hasn't been before or if the compiled version is out of date.
Here's the basic idea, this code could be improved a lot--search the path rather then assuming everything's in the same directory, handle other file extensions better, etc. Also i'm pretty green at haskell coding (ghc-compiled-script.hs):
import Control.Monad
import System
import System.Directory
import System.IO
import System.Posix.Files
import System.Posix.Process
import System.Process
getMTime f = getFileStatus f >>= return . modificationTime
main = do
scr : args <- getArgs
let cscr = takeWhile (/= '.') scr
scrExists <- doesFileExist scr
cscrExists <- doesFileExist cscr
compile <- if scrExists && cscrExists
then do
scrMTime <- getMTime scr
cscrMTime <- getMTime cscr
return $ cscrMTime <= scrMTime
else
return True
when compile $ do
r <- system $ "ghc --make " ++ scr
case r of
ExitFailure i -> do
hPutStrLn stderr $
"'ghc --make " ++ scr ++ "' failed: " ++ show i
exitFailure
ExitSuccess -> return ()
executeFile cscr False args Nothing
Now we can create scripts such as this (hs-echo.hs):
#! ghc-compiled-script
import Data.List
import System
import System.Environment
main = do
args <- getArgs
putStrLn $ foldl (++) "" $ intersperse " " args
And now running it:
$ time hs-echo.hs "Hello, world\!"
[1 of 1] Compiling Main ( hs-echo.hs, hs-echo.o )
Linking hs-echo ...
Hello, world!
hs-echo.hs "Hello, world!" 0.83s user 0.21s system 97% cpu 1.062 total
$ time hs-echo.hs "Hello, world, again\!"
Hello, world, again!
hs-echo.hs "Hello, world, again!" 0.01s user 0.00s system 60% cpu 0.022 total

Using ghc -e is pretty much equivalent to invoking ghci. I believe that GHC's runhaskell compiles the code to a temporary executable before running it, as opposed to interpreting it like ghc -e/ghci, but I'm not 100% certain.
$ time echo 'Hello, world!'
Hello, world!
real 0m0.021s
user 0m0.000s
sys 0m0.000s
$ time ghc -e 'putStrLn "Hello, world!"'
Hello, world!
real 0m0.401s
user 0m0.031s
sys 0m0.015s
$ echo 'main = putStrLn "Hello, world!"' > hw.hs
$ time runhaskell hw.hs
Hello, world!
real 0m0.335s
user 0m0.015s
sys 0m0.015s
$ time ghc --make hw
[1 of 1] Compiling Main ( hw.hs, hw.o )
Linking hw ...
real 0m0.855s
user 0m0.015s
sys 0m0.015s
$ time ./hw
Hello, world!
real 0m0.037s
user 0m0.015s
sys 0m0.000s
How hard is it to simply compile all your "scripts" before running them?
Edit
Ah, providing binaries for multiple architectures is a pain indeed. I've gone down that road before, and it's not much fun...
Sadly, I don't think it's possible to make any Haskell compiler's startup overhead any better. The language's declarative nature means that it's necessary to read the entire program first even before trying to typecheck anything, nevermind execution, and then you either suffer the cost of strictness analysis or unnecessary laziness and thunking.
The popular 'scripting' languages (shell, Perl, Python, etc.) and the ML-based languages require only a single pass... well okay, ML requires a static typecheck pass and Perl has this amusing 5-pass scheme (with two of them running in reverse); either way, being procedural means that the compiler/interpreter has a lot easier of a job assembling the bits of the program together.
In short, I don't think it's possible to get much better than this. I haven't tested to see if Hugs or GHCi has a faster startup, but any difference there is still faaar away from non-Haskell languages.

If you are really concerned with speed you are going to be hampered by re-parsing the code for every launch. Haskell doesn't need to be run from an interpreter, compile it with GHC and you should get excellent performance.

You have two parts to this question:
you care about performance
you want scripting
If you care about performance, the only serious option is GHC, which is very very fast: http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all
If you want something light for Unix scripting, I'd use GHCi. It is about 30x faster than Hugs, but also supports all the libraries on hackage.
So install GHC now (and get GHCi for free).

What about having a ghci daemon and a feeder script that takes the script path and location, communicates with the already running ghci process to load and execute the script in the proper directory and pipes the output back to the feeder script for stdout?
Unfortunately, I have no idea how to write something like this, but it seems like it could be really fast judging by the speed of :l in ghci. As it seems most of the cost in runhaskell is in starting up ghci, not parsing and running the script.
Edit: After some playing around, I found the Hint package (a wrapper around the GHC API) to be of perfect use here. The following code will load the passed in module name (here assumed to be in the same directory) and will execute the main function. Now 'all' that's left is to make it a daemon, and have it accept scripts on a pipe or socket.
import Language.Haskell.Interpreter
import Control.Monad
run = runInterpreter . test
test :: String -> Interpreter ()
test mname =
do
loadModules [mname ++ ".hs"]
setTopLevelModules [mname]
res <- interpret "main" (as :: IO())
liftIO res
Edit2: As far as stdout/err/in go, using this specific GHC trick It looks like it would be possible to redirect the std's of the client program into the feeder program, then into some named pipes (perhaps) that the daemon is connected to at all times, and then have the daemon's stdout back to another named pipe that the feeder program is listening to. Pseudo-example:
grep ... | feeder my_script.hs | xargs ...
| ^---------------- <
V |
named pipe -> daemon -> named pipe
Here the feeder would be a small compiled harness program to just redirect the std's into and then back out of the daemon and give the name and location of the script to the daemon.

Related

How to make a change to the source code in Haskell using GHCi?

I am new to Haskell and am using GHCi to edit and run Haskell files. For some reason, I am not able to edit the source code of the file. The behavior I am getting is extremely odd.
Below is a screenshot of what is happening. I am loading the file lec3.hs and am attempting to edit this file to add the following function: myfun = \w -> not w. For some reason, this function successfully runs when I call it immediately after: myfun False. I do not need to reload the file.
It is clear that the function is not being added to the source code. When I reload the file, I am getting an error stating that myfun does not exist.
Could someone help me understand why GHCi is behaving this way, and how to fix such behaviour? I have already spent an hour trying to figure this out. I would sincerely appreciate any help.
Typing things into GHCi isn't supposed to add them to the source. But if you have loaded a file into GHCi you can edit it using the :e command, and when you close the editor it will be automatically reloaded.
You can use :e filename.hs if you are dealing with more than one file and you need to specify.
Generally it's easier to work in a separate editor and just reload into GHCi with :r, but :e is occasionally useful.
To answer
Is it possible to edit a .hs file from GHCi?
It technically speaking is possible, since Haskell has like any other language file-IO operations. Concretely, appendFile allows you to add content to a file.
$ cat >> Foo.hs # Creating a simple Haskell file
foo :: Int
foo = 3
$ ghci Foo.hs
GHCi, version 8.2.2: http://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /home/sagemuej/.ghc/ghci.conf
Loaded GHCi configuration from /home/sagemuej/.ghci
[1 of 1] Compiling Main ( Foo.hs, interpreted )
Ok, one module loaded.
*Main> foo
3
*Main> appendFile "Foo.hs" $ "myfun = \\w -> not w"
*Main> :r
[1 of 1] Compiling Main ( Foo.hs, interpreted )
Ok, one module loaded.
*Main> myfun False
True
But this really is not a good way to edit files. It makes much more sense to have two windows open, one with a text editor and one with the REPL, i.e. with GHCi. It can be either two completely separate OS windows, or two sub-windows of your IDE or whatever, but in either case GHCi will be used only for evaluation and one-line prototyping, not for actually adding/editing code. (The one-line prototypes can be copy&pasted into the editor, if it helps.)

The optimal way to set a breakpoint in the Python source code while debugging CPython by GDB

I use GDB to understanding how CPython executes the test.py source file and I want to stop the CPython when it starts the execution of opcode I am interested.
OS: Ubuntu 18.04.2 LTS
Debugger: GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
The first problem - many CPython's .py own files are executed before my test.py gets its turn, so I can't just break at the _PyEval_EvalFrameDefault - there are many of them, so I should distinguish my file from others.
The second problem - I can't set the condition like "when the filename is equal to the test.py", because the filename is not a simple C string, it is the CPython's Unicode object, so the standard GDB string functions can't be used for comparing.
At this moment I do the next trick for breaking the execution at the needed line of test.py source:
For example, I have the source file:
x = ['a', 'b', 'c']
# I want to set the breakpoint at this line.
for e in x:
print(e)
I add the binary left shift operator to the code:
x = ['a', 'b', 'c']
# Added for breakpoint
a = 12
b = 2 << a
for e in x:
print(e)
And then, track the BINARY_LSHIFT opcode execution in the Python/ceval.c file by this GDB command:
break ceval.c:1327
I have chosen the BINARY_LSHIFT opcode, because of its seldom usage in the code. Thus, I can reach the needed part of .py file quickly - it happens once in the all other .py modules executed before my test.py.
I look the more straightforward way of doing the same, so
the questions:
Can I catch the moment the test.py starts executing? I should mention, what the test.py filename is appearing on different stages: parsing, compilation, execution. So, it also will be good to can break the CPython execution at the any stage.
Can I specify the line of the test.py, where I want to break? It is easy for .c files, but is not for .py files.
My idea would be to use a C-extension, to make setting C-breakpoints possible in a python-script (similar to pdb.set_trace() or breakpoint() since Python3.7), which I will call cbreakpoint.
Consider the following python-script:
#example.py
from cbreakpoint import cbreakpoint
cbreakpoint(breakpoint_id=1)
print("hello")
cbreakpoint(breakpoint_id=2)
It could be used as follows in gdb:
>>> gdb --args python example.py
[gdb] b cbreakpoint
[gdb] run
Now, the debuger would stops at cbreakpoint(breakpoint_id=1) and cbreakpoint(breakpoint_id=2).
Here is proof of concept, written in Cython to avoid the otherwise needed boilerplate-code:
#cbreakpoint.pyx
cdef extern from *:
"""
long long last_breakpoint_id = -1;
void cbreakpoint(long long breakpoint_id){
last_breakpoint_id = breakpoint_id;
}
"""
void c_cbreakpoint "cbreakpoint"(long long breakpoint_id)
def cbreakpoint(breakpoint_id = 0):
c_cbreakpoint(breakpoint_id)
which can be build inplace via:
cythonize -i cbreakpoint.pyx
If Cython isn't installed, I have uploaded a version which doesn't depend on Cython (too much code for this post) on github.
It is also possible to break conditionally, given the breakpoint_id, i.e.:
>>> gdb --args python example.py
[gdb] break src/cbreakpoint.c:595 if breakpoint_id == 2
[gdb] run
will break only after hello was printed - at cbreakpoint with id=2 (while cbreakpoint with id=1 will be skipped). Depending on Cython version the line can vary, but can be found out once gdb stops at cbreakpoint.
It would also do something similar without any additional modules:
add breakpoint or import pdb; pdb.set_trace() instead of cbreakpoint
gdb --args python example.py + run
When pdb interrupts the program, hit Ctrl+C in order to interrupt in gdb.
Activate breakpoints in gdb.
continue in gdb and then in pdb (i.e. c+enter twice).
A small problem is, that after that the breakpoints might be hit while in pdb, so the first method is a little bit more robust.

Python multiprocessing stdin input

All code written and tested on python 3.4 windows 7.
I was designing a console app and had a need to use stdin from command-line (win os) to issue commands and to change the operating mode of the program. The program depends on multiprocessing to deal with cpu bound loads to spread to multiple processors.
I am using stdout to monitor that status and some basic return information and stdin to issue commands to load different sub-processes based on the returned console information.
This is where I found a problem. I could no get the multiprocessing module to accept stdin inputs but stdout was working just fine. I think found the following help on stack So I tested it and found that with the threading module this all works great, except for the fact that all output to stdout is paused until each time stdin is cycled due to GIL lock with stdin blocking.
I will say I have been successful with a work around implemented with msvcrt.kbhit(). However, I can't help but wonder if there is some sort of bug in the multiprocessing feature that is making stdin not read any data. I tried numerous ways and nothing worked when using multiprocessing. Even attempted to use Queues, but I did not try pools, or any other methods from multiprocessing.
I also did not try this on my linux machine since I was focusing on trying to get it to work.
Here is simplified test code that does not function as intended (reminder this was written in Python 3.4 - win7):
import sys
import time
from multiprocessing import Process
def function1():
while True:
print("Function 1")
time.sleep(1.33)
def function2():
while True:
print("Function 2")
c = sys.stdin.read(1) # Does not appear to be waiting for read before continuing loop.
sys.stdout.write(c) #nothing in 'c'
sys.stdout.write(".") #checking to see if it works at all.
print(str(c)) #trying something else, still nothing in 'c'
time.sleep(1.66)
if __name__ == "__main__":
p1 = Process(target=function1)
p2 = Process(target=function2)
p1.start()
p2.start()
Hopefully someone can shed light on whether this is intended functionality, if I didn't implement it correctly, or some other useful bit of information.
Thanks.
When you take a look at Pythons implementation of multiprocessing.Process._bootstrap() you will see this:
if sys.stdin is not None:
try:
sys.stdin.close()
sys.stdin = open(os.devnull)
except (OSError, ValueError):
pass
You can also confirm this by using:
>>> import sys
>>> import multiprocessing
>>> def func():
... print(sys.stdin)
...
>>> p = multiprocessing.Process(target=func)
>>> p.start()
>>> <_io.TextIOWrapper name='/dev/null' mode='r' encoding='UTF-8'>
And reading from os.devnull immediately returns empty result:
>>> import os
>>> f = open(os.devnull)
>>> f.read(1)
''
You can work this around by using open(0):
file is either a string or bytes object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped. (If a file descriptor is given, it is closed when the returned I/O object is closed, unless closefd is set to False.)
And "0 file descriptor":
File descriptors are small integers corresponding to a file that has been opened by the current process. For example, standard input is usually file descriptor 0, standard output is 1, and standard error is 2:
>>> def func():
... sys.stdin = open(0)
... print(sys.stdin)
... c = sys.stdin.read(1)
... print('Got', c)
...
>>> multiprocessing.Process(target=func).start()
>>> <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>
Got a

Haskell error on text file as input on Mac

I have a Main.hs file with two functions.
Module Main where
import Data.List
main :: IO()
main = interact reverse
functionThatWorks = putStrLn "Ajax"
After I set the directory and load Main.hs I have no problem with calling functionThatWorks.
Except when I want to take a text file as input like so:
Main<in.txt or ./Main<in.txt
I get an error saying 'parse error on input 'in' '
Does anyone know I can make this work in the Terminal?
p.s. I use a Mac.
Unfortunately ghci doesn't understand input redirection like the shell does.
I would suggest running your program with runhaskell:
runhaskell Main.hs < in.txt

ipython notebook : how to parallelize external script

I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.
Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.
Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :
for my_file in $(cat list_file); do
python pgm.py my_file &
done
But bash interpreter of Ipython notebook doesn't handle the background mode.
It seems that solution was to use parallel library from ipython.
I tried :
from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines
But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.
Thanks.
One year later, I eventually managed to get what I wanted.
1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.
def my_func(my_file):
!python pgm.py {my_file}
Don't forget the {} when using !
Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.
2) Start your ipython notebook Cluster with the number of CPU you want.
Wait 2s and execute the following cell:
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
3) Get a list of file you want to process:
files = list_of_files
4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).
r = view.map_async(my_func, files)
While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).
And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.
Hope this will help others !
This tutorial might be of some help:
http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb
Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.
Edit: Still works with Jupyter, see here for difference and potential issues you may encounter
Note that if you use external libraries in your function, you need to import them on the different engines with:
%px import numpy as np
or
%%px
import numpy as np
import pandas as pd
Same with variable and other functions, you need to push them to the engine name space:
rc[:].push(dict(
foo=foo,
bar=bar))
If you're trying to executing some external scripts in parallel, you don't need to use IPython's parallel functionality. Replicating bash's parallel execution can be achieved with the subprocess module as follows:
import subprocess
procs = []
for i in range(10):
procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE))
results = []
for proc in procs:
stdout, _ = proc.communicate()
results.append(stdout)
Be wary that if your subprocess generates a lot of output, the process will block. If you print the output (results) you get:
print results
['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']

Resources