Faster way to compile Factor Programs - compilation

I really love the Factor language. But I find that compiling programs written in it is incredibly slow, and thus it's not feasible to create real projects with Factor.
For example, compiling the sample Calculator WebApp takes about 5 minutes on my laptop (i3 processor, 2GB RAM, running Fedora 15).
I've searched around but couldn't find a faster way to compile Factor programs than using the interpreter (the main factor binary executable).
It becomes ridiculous when you attempt to only use the interpreter for each run and not "deploy" your program to a native binary file (which doesn't even work on many programs).
It means that every time I want to run the Calculator, for example, I have to wait a 5 minute cold-start duration.
I'd like to know whether this is a common issue, and whether there's a good way to tackle it.

I admit that before today, I had never heard of Factor. I took the opportunity to play with it. It is looking nice (reminds me of squeak-vm and lisp at the same time). I'll cut the smalltalk (pun very much intended) and jump to your question.
Analysis
It appears that the way Factor works, results in loading vocabularies being slow.
I compiled Factor on my 64 bit quadcore linux system (from git revision 60b1115452, Thu Oct 6). Putting everything on tmpfs the build dir takes 641Mb, of which 2x114Mb is in the factor.image and its backup (factor.image.fresh).
When strace-ing the calculator app loading, there is a huge list of factor files being loaded:
3175 factor files are touched.
compilation of these takes roughly 30 seconds on my box
the memory usage maxes out on just under 500Mb (virtual) and 300Mb reserved set
I'm strongly suspecting your box is low on memory, and might be getting very swappy This would definitely explain compilation taking 5 minutes
Can you confirm whether this is the case (likely if you are running some kind of shared host or VPS appliance). If you run a virtual machine, consider increasing the available system memory.
Saving Heap Images (snapshots)
I already mentioned the factor.image file (114Mb) before. It contains a 'precompiled' (bootstrapped, actually) heap image for the Factor VM. All operations in the VM (working on the UI listener or compiling factor files) affects the heap image.
To avoid having to recompile your source files time and time again, you can save the end-result into a custom heap image:
http://docs.factorcode.org/content/article-images.html
Images
To start Factor with a custom image, use the -i=image command line switch; see Command line switches for the VM.
One reason to save a custom image is if you find yourself loading the same libraries in every Factor session; some libraries take a little while to compile, so saving an image with those libraries loaded can save you a lot of time.
For example, to save an image with the web framework loaded,
USE: furnace
save
New images can be created from scratch: Bootstrapping new images
Deploying applications
Saving heap images results in files that will (typically) be bigger than the original bootstrap image.
The Application deployment tool creates stripped-down images containing just enough code to run a single application
The stand-alone application deployment tool, implemented in the tools.deploy vocabulary, compiles a vocabulary down to a native executable which runs the vocabulary's MAIN: hook. Deployed executables do not depend on Factor being installed, and do not expose any source code, and thus are suitable for delivering commercial end-user applications.
Most of the time, the words in the tools.deploy vocabulary should not be used directly; instead, use Application deployment UI tool.
You must explicitly specify major subsystems which are required, as well as the level of reflection support needed. This is done by modifying the deployment configuration prior to deployment.
Concluding
I expect you'll benefit from (in order of quickest win):
increasing available RAM (only quick in virtual environments...)
saving a heap image with
USE: db.sqlite
USE: furnace.alloy
USE: namespaces
USE: http.server
save
This step brought the compilation on my system down from ~30s to 0.835s
deploying the calculator webapp to a stripped heap image (refer to the source for deployment hints)
In short, thanks for bringing Factor to my attention, and I hope my findings will be of any help, Cheers

Related

Trying to understand why freading a file over a network is so much slower over Samba than NFS

We have a program building a 3d Model from three files hosted on a Linux file server. Basically x.bin, y.bin and z.bin. It builds the models one z level at a time, and is read each file for every "slice".
On Linux machines running this program, the first slice takes around 45 seconds, and then ~2 seconds for every "slice" after that.
On Windows, the exact same program performing the exact same operation running the exact same script and code takes 5 minutes for the first slice, and around a minute and a half each slice after that.
Reading file over network slow due to extra reads
This thread seemed to have a guy with a similar problem, but the truth is that I'm still unclear on how NFS can be faster, as well as how I can suggest a change to the actual developers as to how to improve performance. The code is OS independent, I believe it's just using C's fread, fseek, etc to read the file information over the network.
How does NFS transfer/read data that it can be 60x faster than samba?
How can I get that performance on samba?
I'm not 100% sure as I don't know much about samba, but my guess is that nfs support fseek and thus can just position over the next splice and return that data. While samba probably doesn't and have to return the full file from the server and discard the "unused" content.
By the way, it's not the exact same program you're running, you probably recompile them right? So it's been transcode to a lot of different system call with each platforms having differents pros and cons...

Downloading simultaneously multiple files with big file lists on windows

I am looking for a program that could download simultaneously (like, about 100 files in parallel) multiple files. The only thing is, that this program should be able to handle very big lists of files (like 200MB of links), and should work on windows.
As for now, I have tested aria2, but when I load my file list I get out of memory exception (aria is trying to use over 4Gb of memory!). Also I tried using mulk, but this thing just is not working (because I don't believe that it is loading my files list for about two hours now, when generating this list and writing onto the disk took me about a half of a minute). I haven't tried using wget yet, but as far as I know it cannot download in parallel, am I right?
Is there any software that could handle my requirements?
With aria2, you can use --deferred-input option to reduce memory footprint for list input. Also making --max-download-result option low, such as 100, may reduce memory usage too.

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.
Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.
Send your app a wm_hibernate before your performance loop. Will clean up things
We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

A Linux Kernel Module for Self-Optimizing Hard Drives: Advice?

I am a computer engineering student studying Linux kernel development. My 4-man team was tasked to propose a kernel development project (to be implemented in 6 weeks), and we came up with a tentative "Self-Optimizing Hard Disk Drive Linux Kernel Module". I'm not sure if that title makes sense to the pros.
We based the proposal on this project.
The goal of the project is to minimize hard disk access times. The plan is to create a special partition where the "most commonly used" files are to be placed. An LKM will profile, analyze, plan, and redirect I/O operations to the hard disk. This LKM should primarily be able to predict and redirect all file access (on files with sizes of < 10 MB) with minimal overhead, and lessen average read/write access times to the hard disk. I believe Apple's HFS has this feature.
Can anybody suggest a starting point? I recently found a way to redirect I/O operations by intercepting system calls (by hijacking all the read/write ones). However, I'm not convinced that this is the best way to go. Is there a way to write a driver that redirects these read/write operations? Can we perhaps tap into the read/write cache to achieve the same effect?
Any feedback at all is appreciated.
You may want to take a look at Unionfs. You don't even need a LKM - just a some user-space daemon which would subscribe to inotify events, keep statistics and migrate files between partitions. Unionfs will combine both partitions into a single logical filesystem.
There are many ways in which such optimizations might be useful:
accessing file A implies file B access is imminent. Example: opening an icon file for a media file by a media player
accessing any file in some group G of files means that other files in the group will be accessed shortly. Example: mysql receives a use somedb command which implies all the file tables, indexes, etc. will be accessed.
a program which stops reading a sequential file suggests the program has stalled or exited, so predictions of future accesses associated with that file should be abandoned.
having multiple (yet transparent) copies of some frequently referenced files strategically sprinkled about can use the copy nearest the disk heads. Example: uncached directories or small, frequently accessed settings files.
There are so many possibilities that I think at least 50% of an efficient solution would be a sensible, limited specification for what features you will attempt to implement and what you won't. It might be valuable to study how Microsoft's Vista's aggressive file caching mechanism disappointed.
Another problem you might encounter with a modern Linux distribution is how well the system already does much of what you plan to improve. In fact, measuring the improvement might be a big challenge. I suggest writing a benchmark program which opens and reads a series of files and precisely times the complete sequence. Run it several times with your improvements enabled and disabled. But you'll have to reboot in between for valid timing....

Windows Stalls When My Program Uses Swapfile

I am running a user mode program on normal priority. My program is searching an NP problem, and as a result, uses up a lot of memory which eventually ends up in the swap file.
Then my mouse freezes up, and it takes forever for task manager to open up and let me end the process.
What I want to know is how I can stop my Windows operating system from completely locking up from this even though only 1 out of my 2 cores are being used.
Edit:
Thanks for the replies.
I know that making it use less memory will help, but it just doesn't make sense to me that the whole OS should lock up.
The obvious answer is "use less memory". When your app uses up all the
available memory, the OS has to page the task manager (etc.) out to make room for your app. When you switch programs, the OS has to page the other programs back in (as they are needed).
Disk reads are slower than memory reads, so everything appears to be
going slower.
If you want to avoid this, have your app manage its own memory, or
use a better algorithm than brute force. (There are genetic
algorithms, simulated annealing, etc.)
The problem is that when another program (e.g. explorer.exe) is going to execute, all of its code and memory has been swapped out. To make room for the other program Windows has to first write data that your program is using to disk, then load up the other program's memory. Every new page of code that is executed in the other program requires disk access, causing it to run slowly.
I don't know the access pattern of your program, but I'm guessing it touches all of its memory pages a lot in a random fashion, which makes the problem worse because as soon as Windows evicts a memory page from your program, suddenly you need it again and Windows has to find some other page to give the same treatment.
To give other processes more RAM to live in, you can use SetProcessWorkingSetSize to reduce the maximum amount of RAM that your program may use. Of course this will make your program run more slowly because it has to do more swapping.
Another alternative you could try is to add more drives to the system, and distribute the swap files over those. You may have a dual-core CPU, but you have only a single drive. Distributing the swap file over multiple drives allows Windows to balance work across them (although I don't have first-hand experience of how well it does this).
I don't think there's a programming answer to this question, aside from "restructure your app to use less memory." The swapfile problem is most likely due to the bottleneck in accessing the disk, especially if you're using an IDE HDD or a highly fragmented swapfile.
It's a bit extreme, but you could always minimise your swap file so you don't have all the disk thashing, and your program isn't allowed to allocate much virtual memory. Under Control panel / Advanced / Advanced tab / Perfromance / Virtual memory, set the page file to custom size and enter a value of 2mb (smallest allowed on XP). When an allocation fails, you should get an exception and be able exit gracefully. It doesn't quite fix your problem, just speeds it up ;)
Another thing worth considering would be if you are ona 32bit platform, port to a 64bit system and get a box with much more addressable RAM.

Resources