XSLT with Xalan vs. STX with Joost - performance

Where can I find performance metrics (memory/time) for a non-trivial example of using XSLT (with Xalan) compared to using STX (with Joost)

Probably there is no universal set of benchmarks. For XSLT there is (was?) XSLTMark, but this is for comparing the XSLT engines.
There is one page with the comparison of the same transformation written in different transformation languages.
Probably the best option is to model your problem, generate test data and measure the things you are interested in.

I agree in that real answers are best obtained by writing your own benchmark.
For what it's worth, my recollection is that many developers had high hopes for STX to be much faster than XSLT processors; but found the actual performance of implementations to fall short on expectations. Part of the reason may be that XSLT processor implementations are ridiculously well optimized by now, and thus can handle simple transformations very efficiently, all things considered. As such, STX implementations would also need to spend time honing implementation to same degree, to produce significant speed improvements for common transformations.

You really should use your own benchmark to cover the things you use.
But here's one data point, ( http://www.kindle-maps.com/blog/some-performance-information-on-joost-stx.html ), a 1.3GB XML file (from OpenStreetMap data), 1,800,000ish nodes were processed with a simple STX template in 3 minutes on a low end laptop.

Related

Efficiency of Streams in scheme

As far as I learned using streams in large programs are way more efficient than using normal lisp in DrRacket.So why not the default evaluation is lazy evaluation in DrRacket?I wrote and put a timer procedure which calculates the time the work needed to be complete and in every complex program lazy eval was a lot faster.
AFAIK using streams when you are doing something like sorting is a waste of cycles since you need to be finished with the sort in order to know the first element. If you have tasks that work like a sort so that you'll need to evaluate a whole set you'll end up using more time than without streams. The reason for that is that the whole stream system has a cost as well as benefits.
The benefits of streams are the fact that you can do calculations in parallel so that the program doesn't need to do a whole loop before processing the first element. If you have n layers of processing streams you'll benefit when your program quits and all the other layers hasn't served you the whole thing yet.
DrRacket is not a language but an IDE. Racket is both a language (#!racket as first line of source) and the name of the implementation that implements it.
Racket supports #!lazy which is a lazy version of Racket. Basically everything works just like streams do, everywhere. You'll have the same benefits and cost.
None of the mentioned languages are Scheme, but #!racket was based on and was a superset of #!r5rs. Since then you have #!r6rs and the new #!r7rs. None of the official Scheme reports are lazy. The reason is that its predecessor was eager and making it lazy would completely change the language and ruin all backwards compatibility.
The innovation of Scheme in 1975 was lexical closures. The creators made lazy evaluation by need in an later report (by implementing delay and force). Other languages, like Haskell, are built to be lazy from the ground and they have a more advanced compiler to constant fold and make its code snappy.

What XSLT processor is better to use for small parallel transformations

I need to execute lots (abot 100-200) of parallel XML transformation in a cycle. Average XML size is about 5Kb. Relation of used XSL to XML to be parsed is about 0.25, that means that some XMLs are sharing the same XSL. XSL here could be cached, but unfortunately XSL is not stored in File, and generated dynamically within an application.
So what XSLT processor suits better for my case?
p.s. language - Java, transformation final result type - String
thanks in advance
I'm not sure I have fully understood the question, but I'll try.
Firstly, there seem to be two separate questions, which I think are completely orthogonal. First is the question of how to cache your stylesheets: if you are running a transformation more than once then it's a good idea to compile it once and use it repeatedly. In general this isn't a problem, unless there isn't enough memory to cache all the stylesheets in which case you need some kind of LRU strategy.
The second question is your choice of XSLT processor. I think you can use the same caching architecture whichever processor you choose, so this question won't constrain your choice. In the Java world the main free/open-source processors are Xalan (with versions from Apache and embedded in the JDK) and Saxon-HE; since Saxon-HE supports XSLT 2.0 and is usually faster there is little competition. Among paid-for processors the main contenders are IBM's Websphere processor and Saxon-EE; they should both handle this workload with ease, but you're unlikely to consider the IBM product unless you're making a substantial investment in IBM middleware.
(I won't try to hide that I'm the developer of the Saxon product...)

Examples where compiler-optimized functional code performs better than imperative code

One of the promises of side-effect free, referentially transparent functional programming is that such code can be extensively optimized.
To quote Wikipedia:
Immutability of data can, in many cases, lead to execution efficiency, by allowing the compiler to make assumptions that are unsafe in an imperative language, thus increasing opportunities for inline expansion.
I'd like to see examples where a functional language compiler outperforms an imperative one by producing a better optimized code.
Edit: I tried to give a specific scenario, but apparently it wasn't a good idea. So I'll try to explain it in a different way.
Programmers translate ideas (algorithms) into languages that machines can understand. At the same time, one of the most important aspects of the translation is that also humans can understand the resulting code. Unfortunately, in many cases there is a trade-off: A concise, readable code suffers from slow performance and needs to be manually optimized. This is error-prone, time consuming, and it makes the code less readable (up to totally unreadable).
The foundations of functional languages, such as immutability and referential transparency, allow compilers to perform extensive optimizations, which could replace manual optimization of code and free programmers from this trade-off. I'm looking for examples of ideas (algorithms) and their implementations, such that:
the (functional) implementation is close to the original idea and is easy to understand,
it is extensively optimized by the compiler of the language, and
it is hard (or impossible) to write similarly efficient code in an imperative language without manual optimizations that reduce its conciseness and readability.
I apologize if it is a bit vague, but I hope the idea is clear. I don't want to give unnecessary restrictions on the answers. I'm open to suggestions if someone knows how to express it better.
My interest isn't just theoretical. I'd like to use such examples (among other things) to motivate students to get interested in functional programming.
At first, I wasn't satisfied by a few examples suggested in the comments. On second thoughts I take my objections back, those are good examples. Please feel free to expand them to full answers so that people can comment and vote for them.
(One class of such examples will be most likely parallelized code, which can take advantage of multiple CPU cores. Often in functional languages this can be done easily without sacrificing code simplicity (like in Haskell by adding par or pseq in appropriate places). I' be interested in such examples too, but also in other, non-parallel ones.)
There are cases where the same algorithm will optimize better in a pure context. Specifically, stream fusion allows an algorithm that consists of a sequence of loops that may be of widely varying form: maps, filters, folds, unfolds, to be composed into a single loop.
The equivalent optimization in a conventional imperative setting, with mutable data in loops, would have to achieve a full effect analysis, which no one does.
So at least for the class of algorithms that are implemented as pipelines of ana- and catamorphisms on sequences, you can guarantee optimization results that are not possible in an imperative setting.
A very recent paper Haskell beats C using generalised stream fusion by Geoff Mainland, Simon Peyton Jones, Simon Marlow, Roman Leshchinskiy (submitted to ICFP 2013) describes such an example. Abstract (with the interesting part in bold):
Stream fusion [6] is a powerful technique for automatically transforming
high-level sequence-processing functions into efficient implementations.
It has been used to great effect in Haskell libraries
for manipulating byte arrays, Unicode text, and unboxed vectors.
However, some operations, like vector append, still do not perform
well within the standard stream fusion framework. Others,
like SIMD computation using the SSE and AVX instructions available
on modern x86 chips, do not seem to fit in the framework at
all.
In this paper we introduce generalized stream fusion, which
solves these issues. The key insight is to bundle together multiple
stream representations, each tuned for a particular class of stream
consumer. We also describe a stream representation suited for efficient
computation with SSE instructions. Our ideas are implemented
in modified versions of the GHC compiler and vector library.
Benchmarks show that high-level Haskell code written using
our compiler and libraries can produce code that is faster than both
compiler- and hand-vectorized C.
This is just a note, not an answer: the gcc has a pure attribute suggesting it can take account of purity; the obvious reasons are remarked on in the manual here.
I would think that 'static single assignment' imposes a form of purity -- see the links at http://lambda-the-ultimate.org/node/2860 or the wikipedia article.
make and various build systems perform better for large projects by assuming that various build steps are referentially transparent; as such, they only need to rerun steps that have had their inputs change.
For small to medium sized changes, this can be a lot faster than building from scratch.

Computationally intensive algorithms

I am planning to write a bunch of programs on computationally intensive algorithms. The programs would serve as an indicator of different compiler/hardware performance.
I would want to pick up some common set of algorithms which are used in different fields, like Bioinformatics, Gaming, Image Processing, et al. The reason I want to do this would be to learn the algorithms and have a personal mini benchmark suit that would be small | useful | easy to maintain.
Any advice on algorithm selection would be tremendously helpful.
Benchmarks are not worthy of your attention!
The very best guide is to processor performance is: http://www.agner.org/optimize/
And then someone will chuck it in a box with 3GB of RAM defeating dual-channel hopes and your beautifully tuned benchmark will give widely different results again.
If you have a piece of performance critical code and you are sure you've picked the winning algorithm you can't then go use a generic benchmark to determine the best compiler. You have to actually compile your specific piece of code with each compiler and benchmark them with that. And the results you gain, whilst useful for you, will not extrapolate to others.
Case in point: people who make compression software - like zip and 7zip and the high-end stuff like PPMs and context-mixing and things - are very careful about performance and benchmark their programs. They hang out on www.encode.ru
And the situation is this: for engineers developing the same basic algorithm - say LZ or entropy coding like arithmetic-coding and huffman - the engineers all find very different compilers are better.
That is to say, two engineers solving the same problem with the same high-level algorithm will each benchmark their implementation and have results recommending different compilers...
(I have seen the same thing repeat itself repeatedly in competition programming e.g. Al Zimmermann's Programming Contests which is an equally performance-attentive community.)
(The newer GCC 4.x series is very good all round, but that's just my data-point, others still favour ICC)
(Platform benchmarks for IO-related tasks is altogether another thing; people don't appreciate how differently Linux, Windows and FreeBSD (and the rest) perform when under stress. And benchmarks there - on the same workload, same machine, different machines or different core counts - would be very generally informative. There aren't enough benchmarks like that about sadly.)
There was some work done at Berkeley about this a few years ago. The identified 13 common application paterns for parallel programming, the "13 Dwarves". The include things like linear algebra, n-body models, FFTs etc
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
See page 10 onwards.
There are some sample .NET implementations here:
http://paralleldwarfs.codeplex.com/
The typical one is Fast Fourier Transform, perhaps you can also do something like the Lucas–Lehmer primality test.
I remember a guy who tested computational performance of machines, compiler versions by inverting Hilbert matrices.
For image processing, median filtering (used for removing noise, bad pixels) is always too slow. It might make a good test, given a large enough image say 1000x1000.

Is there a functional algorithm which is faster than an imperative one?

I'm searching for an algorithm (or an argument of such an algorithm) in functional style which is faster than an imperative one.
I like functional code because it's expressive and mostly easier to read than it's imperative pendants. But I also know that this expressiveness can cost runtime overhead. Not always due to techniques like tail recursion - but often they are slower.
While programming I don't think about runtime costs of functional code because nowadays PCs are very fast and development time is more expensive than runtime. Furthermore for me readability is more important than performance. Nevertheless my programs are fast enough so I rarely need to solve a problem in an imperative way.
There are some algorithms which in practice should be implemented in an imperative style (like sorting algorithms) otherwise in most cases they are too slow or requires lots of memory.
In contrast due to techniques like pattern matching a whole program like a parser written in an functional language may be much faster than one written in an imperative language because of the possibility of compilers to optimize the code.
But are there any algorithms which are faster in a functional style or are there possibilities to setting up arguments of such an algorithm?
A simple reasoning. I don't vouch for terminology, but it seems to make sense.
A functional program, to be executed, will need to be transformed into some set of machine instructions.
All machines (I've heard of) are imperative.
Thus, for every functional program, there's an imperative program (roughly speaking, in assembler language), equivalent to it.
So, you'll probably have to be satisfied with 'expressiveness', until we get 'functional computers'.
The short answer:
Anything that can be easily made parallel because it's free of side-effects will be quicker on a multi-core processor.
QuickSort, for example, scales up quite nicely when used with immutable collections: http://en.wikipedia.org/wiki/Quicksort#Parallelization
All else being equal, if you have two algorithms that can reasonably be described as equivalent, except that one uses pure functions on immutable data, while the second relies on in-place mutations, then the first algorithm will scale up to multiple cores with ease.
It may even be the case that your programming language can perform this optimization for you, as with the scalaCL plugin that will compile code to run on your GPU. (I'm wondering now if SIMD instructions make this a "functional" processor)
So given parallel hardware, the first algorithm will perform better, and the more cores you have, the bigger the difference will be.
FWIW there are Purely functional data structures, which benefit from functional programming.
There's also a nice book on Purely Functional Data Structures by Chris Okasaki, which presents data structures from the point of view of functional languages.
Another interesting article Announcing Intel Concurrent Collections for Haskell 0.1, about parallel programming, they note:
Well, it happens that the CnC notion
of a step is a pure function. A step
does nothing but read its inputs and
produce tags and items as output. This
design was chosen to bring CnC to that
elusive but wonderful place called
deterministic parallelism. The
decision had nothing to do with
language preferences. (And indeed, the
primary CnC implementations are for
C++ and Java.)
Yet what a great match Haskell and CnC
would make! Haskell is the only major
language where we can (1) enforce that
steps be pure, and (2) directly
recognize (and leverage!) the fact
that both steps and graph executions
are pure.
Add to that the fact that Haskell is
wonderfully extensible and thus the
CnC "library" can feel almost like a
domain-specific language.
It doesn't say about performance – they promise to discuss some of the implementation details and performance in future posts, – but Haskell with its "pureness" fits nicely into parallel programming.
One could argue that all programs boil down to machine code.
So, if I dis-assemble the machine code (of an imperative program) and tweak the assembler, I could perhaps end up with a faster program. Or I could come up with an "assembler algorithm" that exploits some specific CPU feature, and therefor it really is faster than the imperative language version.
Does this situation lead to the conclusion that we should use assembler everywhere? No, we decided to use imperative languages because they are less cumbersome. We write pieces in assembler because we really need to.
Ideally we should also use FP algorithms because they are less cumbersome to code, and use imperative code when we really need to.
Well, I guess you meant to ask if there is an implementation of an algorithm in functional programming language that is faster than another implementation of the same algorithm but in an imperative language. By "faster" I mean that it performs better in terms of execution time or memory footprint on some inputs according to some measurement that we deem trustworthy.
I do not exclude this possibility. :)
To elaborate on Yasir Arsanukaev's answer, purely functional data structures can be faster than mutable data structures in some situations becuase they share pieces of their structure. Thus in places where you might have to copy a whole array or list in an imperative language, where you can get away with a fraction of the copying because you can change (and copy) only a small part of the data structure. Lists in functional languages are like this -- multiple lists can share the same tail since nothing can be modified. (This can be done in imperative languages, but usually isn't, because within the imperative paradigm, people aren't usually used to talking about immutable data.)
Also, lazy evaluation in functional languages (particularly Haskell which is lazy by default) can also be very advantageous because it can eliminate code execution when the code's results won't actually be used. (One can be very careful not to run this code in the first place in imperative languages, however.)

Resources