why is functional language good for big data? - performance

I'm currently working in a bank, and working with Q(kdb+, K whatever its called). I know that this is a functional language, and I also know that a lot of organizations use functional language to deal with large data sets.
I wonder why is the functional language (programming) good for big data? Is it because of the way they compile the code, or some other reasons.
Also, if the idea is wrong, can anyone explain why its wrong?
ps: If there are similar questions, forgive me :P

One of the reasons is that having immutable variables let's you execute code in parallel and scale very easy.

Zdravko is right about immutable state making concurrency easier and less prone to race condition style bugs. However, that helps only with multi-threaded concurrency. When you talk about big data, you are talking about horizontally scaled cluster computing. Not much support for that in Functional Programming languages.
There is something about FP that has captured the imagination of developers with Big Data dreams. Maybe it has something to do with FP's stream oriented higher order functions which let you think in terms of processing data streams. With FP, you solve problems with languaging such as union, intersection, difference, map, flatmap, and reduce.
But FP alone won't work in a distributed computing environment. At OSCON 2014, I learned about some open source projects that integrate FP languages with Hadoop. See Functional Programming and Big Data for a comparative evaluation of three such projects getting traction there days; Netflix Pig Pen, Cascalog, and Apache Spark.

Related

Examples where compiler-optimized functional code performs better than imperative code

One of the promises of side-effect free, referentially transparent functional programming is that such code can be extensively optimized.
To quote Wikipedia:
Immutability of data can, in many cases, lead to execution efficiency, by allowing the compiler to make assumptions that are unsafe in an imperative language, thus increasing opportunities for inline expansion.
I'd like to see examples where a functional language compiler outperforms an imperative one by producing a better optimized code.
Edit: I tried to give a specific scenario, but apparently it wasn't a good idea. So I'll try to explain it in a different way.
Programmers translate ideas (algorithms) into languages that machines can understand. At the same time, one of the most important aspects of the translation is that also humans can understand the resulting code. Unfortunately, in many cases there is a trade-off: A concise, readable code suffers from slow performance and needs to be manually optimized. This is error-prone, time consuming, and it makes the code less readable (up to totally unreadable).
The foundations of functional languages, such as immutability and referential transparency, allow compilers to perform extensive optimizations, which could replace manual optimization of code and free programmers from this trade-off. I'm looking for examples of ideas (algorithms) and their implementations, such that:
the (functional) implementation is close to the original idea and is easy to understand,
it is extensively optimized by the compiler of the language, and
it is hard (or impossible) to write similarly efficient code in an imperative language without manual optimizations that reduce its conciseness and readability.
I apologize if it is a bit vague, but I hope the idea is clear. I don't want to give unnecessary restrictions on the answers. I'm open to suggestions if someone knows how to express it better.
My interest isn't just theoretical. I'd like to use such examples (among other things) to motivate students to get interested in functional programming.
At first, I wasn't satisfied by a few examples suggested in the comments. On second thoughts I take my objections back, those are good examples. Please feel free to expand them to full answers so that people can comment and vote for them.
(One class of such examples will be most likely parallelized code, which can take advantage of multiple CPU cores. Often in functional languages this can be done easily without sacrificing code simplicity (like in Haskell by adding par or pseq in appropriate places). I' be interested in such examples too, but also in other, non-parallel ones.)
There are cases where the same algorithm will optimize better in a pure context. Specifically, stream fusion allows an algorithm that consists of a sequence of loops that may be of widely varying form: maps, filters, folds, unfolds, to be composed into a single loop.
The equivalent optimization in a conventional imperative setting, with mutable data in loops, would have to achieve a full effect analysis, which no one does.
So at least for the class of algorithms that are implemented as pipelines of ana- and catamorphisms on sequences, you can guarantee optimization results that are not possible in an imperative setting.
A very recent paper Haskell beats C using generalised stream fusion by Geoff Mainland, Simon Peyton Jones, Simon Marlow, Roman Leshchinskiy (submitted to ICFP 2013) describes such an example. Abstract (with the interesting part in bold):
Stream fusion [6] is a powerful technique for automatically transforming
high-level sequence-processing functions into efficient implementations.
It has been used to great effect in Haskell libraries
for manipulating byte arrays, Unicode text, and unboxed vectors.
However, some operations, like vector append, still do not perform
well within the standard stream fusion framework. Others,
like SIMD computation using the SSE and AVX instructions available
on modern x86 chips, do not seem to fit in the framework at
all.
In this paper we introduce generalized stream fusion, which
solves these issues. The key insight is to bundle together multiple
stream representations, each tuned for a particular class of stream
consumer. We also describe a stream representation suited for efficient
computation with SSE instructions. Our ideas are implemented
in modified versions of the GHC compiler and vector library.
Benchmarks show that high-level Haskell code written using
our compiler and libraries can produce code that is faster than both
compiler- and hand-vectorized C.
This is just a note, not an answer: the gcc has a pure attribute suggesting it can take account of purity; the obvious reasons are remarked on in the manual here.
I would think that 'static single assignment' imposes a form of purity -- see the links at http://lambda-the-ultimate.org/node/2860 or the wikipedia article.
make and various build systems perform better for large projects by assuming that various build steps are referentially transparent; as such, they only need to rerun steps that have had their inputs change.
For small to medium sized changes, this can be a lot faster than building from scratch.

most suitable language for computationally and memory expensive algorithms

Let's say you have to implement a tool to efficiently solve an NP-hard problem, with unavoidable possible explosion of memory usage (the output size in some cases exponential to the input size) and you are particularly concerned about the performances of this tool at running time. The source code has also to be readable and understandable once the underlying theory is known, and this requirement is as important as the efficiency of the tool itself.
I personally think that 3 languages could be suitable for these three requirements: c++, scala, java.
They all provide the right abstraction on data types that makes it possible to compare different structures or apply the same algorithms (which is also important) to different data types.
C++ has the advantage of being statically compiled and optimized, and with function inlining (if the data structures and algorithms are designed carefully) and other optimisation techniques it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability.
If you also put a lot of care in data representation you can optimise the cache performance, which can gain orders of magnitude in speed when the cache miss rate is low.
Java is instead JIT compiled, which allows to apply optimisations during runtime, and in this category of algorithms that could have different behaviours between different runs, that may be a plus. I fear instead that such an approach could suffer from garbage collector, however in the case of this algorithm it's common to continuously allocate memory and java heap performance is notoriously better than C/C++ and if you implement your own memory manager inside the language you could even achieve good efficiency.
This approach instead is not able to inline method invocation (which induces a huge performance penalty) and doesn't give you control over the cache performance. Among the pros there's a better and cleaner syntax than C++.
My concerns about scala are more or less the same as Java, plus the fact that I can't control how the language is optimised unless I have a deep knowledge on the compiler and the standard library. But well: I get a very clean syntax :)
What's your take on the subject? Have you had to deal with this already? Would you implement an algorithm with such properties and requirements in any of these languages or would you suggest something else? How would you compare them?
Usually I’d say “C++” in a heartbeat. The secret being that C++ simply produces less (memory) garbage that needs managing.
On the other hand, your observation that
however in the case of this algorithm it's common to continuously allocate memory
is a hint that Java / Scala may actually be more suited. But then you could use a small object heap in C++ as well. Boost has one that uses the standard allocator interface, if memory serves.
Another advantage of C++ is obviously the use of abstraction without penalty through templates – i.e. that you can easily create generic algorithmic components that can interact without incurring a runtime overhead due to abstraction. In fact, you noted that
it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability
– this is looking at things the wrong way: Templates allow C++ to achieve performance superior to that of C while still maintaining high abstraction.
D might be worth a look, seeing as how it tries to be a better C++.
From a superficial glance, it has better source code readability than C++ does, so that's one of your points covered.
It also has memory management, which makes playing with algorithms a bit easier.
And templates
Here is a stackoverflow discussion comparing the performance of C++ and D
The languages you noticed were my first guesses as well.
Each language has a different take on how to handle specific issues like compilation, memory management and source code, but in theory, any of them should be fitting to your problem.
It is impossible to tell which is best, and there is likely no major difference if you are familiar enough with all of them to work around their respective quirks.
And obviously, if you actually find the need to optimize (I'm not sure if that's a given), that's possible in each language. Lower level languages obviously offer more options, but are also (far) more complex to actually improve.
A single note about C++ vs Java: This is really a holy war, and if you've followed the recent development you'll probably have your own opinion. I, for one, think Java offers enough good aspects to make up for its flaws, usually.
And a final note on C++ vs C: According to my knowledge, the difference usually amounts to a sufficiently low percentage to ignore this. It it doesn't make a difference for the source code, it's fine to go with C, if C++ could make for easier-to-read source code, go with C++. In any case, the choice is kind of negligible.
In the end, remember that money spent on a few hours of programming/optimizing this could as well go into slightly superior hardware to make up for missed tiny details.
It all boils down to: Any of your options is fine as long as you do it right (domain knowledge).
I would use a language which makes it very easy to work on the algorithm. Get the algorithm right and it could very easily outweigh any advantage from fine-tuning the wrong algorithm. Don't be scared to play around in a language normally thought of as slow in execution speed if that language makes it easier to express algorithmic ideas. It is usually much easier to transcribe the right algorithm into another language than it is to eek-out the last dregs of speed from the wrong algorithm in the fastest executing language.
So do it in a language you are comfortable with and which is expressive. You might surprise yourself and find that what is produced is fast enough!

Getting started with massive data

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!
Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.
I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.
Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

OOP vs PP for algorithms

Which paradigm is better for design and analysis of algorithms?
Which is faster? Because I have a subject called Design and Analysis of Algorithms in university and have a time limit for programs. Is OOP slower than Procedure programming? Or the time difference is not big?
Object-Oriented programming isn't particularly relevant to algorithms. Procedural programming you will need, but as far as algorithms are concerned, object-oriented programming is just another way to package up procedural programming. You have methods instead of functions and classes instead of records/structs, but the only relevant difference is run-time dispatch, and that's just a declarative way to handle a run-time decision that could have been handled some other way.
Object-Oriented programming is more relevant to the larger scale - design patterns etc - whereas algorithms are more relevant to the smaller scale involving a small number (often just one) of procedures.
IMO algorithms exist separat from the OO or PP issue.
Neither OO or PP are 'slow', in either design-time or program performance, they are different approaches.
I would think that Functional Programming would produce cleaner implementation of algorithms.
Having said that, you shouldn't see much of a difference whatever approach you take. An algorithm can be expressed in any language or development paradigm.
Update: (following comments)
Apparently functional programming does not lend itself to implementing algorithms as well as I thought it may. It has other strengths and I mostly mentioned it for completeness sake, as the question only mentioned OOP (object oriented programming) and PP (procedural programming).
the weak link is liekly to be your knowledge - what language & paradigm are you most comfortable with. use that
For design, analysis and development: definitely OOP. It was invented solemnly for the benefit of designers and developers.
For program runtime execution: sometimes PP is more efficient, but often OOP gets reduced to plain PP by the compiler, making them equivalent.
The difference (in execution time) is marginal at best.
Note that there is a more important factor than sheer performance: OOP provide the programmer with better means to organize his code which results in programs that are well structured, understandable, and more reliable (less bugs).
Object oriented programming abstracts many low level details from the programmer. It is designed with the goal
to make it easier to write and read (and understand) programs
to make programs look closer to the real world (and hence, easier to understand).
Procedural programming does not have many abstractions like objects, methods, virtual functions etc.
So, talking about speed: a seasoned expert who knows the internals of how an object oriented system will work can write a program that runs just as fast.
That being said, the speed advantage achieved by use PP over OOP will be very marginal. It boils down to which way you can write programs comfortably.
EDIT:
An interesting anecdote comes to my mind: in the Microsoft Foundation Classes, message passing from one object to the other was implemented using macros that looked like BEGIN_MESSAGE_MAP() and END_MESSAGE_MAP(), and the reason was that it was faster than using virtual functions.
This is one case where the library developers have used OOP, but have knowingly sidestepped a performance bottleneck.
My guess is that the difference is not big enough to worry about, and the time limit should allow using a slower language, since the algorithm used would be what's important.
The purpose of the time limit should IMO be to get you to avoid using for example a O(n3) algorithm when there is a O(n log n)
To make writing code easy and less error prone, you need a language that supports Generics - such as C++ with STL or Java with the Java Collections Framework. If you are implementing an algorithm against a deadline, you may be able to save time by not providing your algorithm with a nice O-O or Generic interface, so making the code you write yourself entirely procedural.
For run time efficiency, you would probably be best writing everything in procedural C - see e.g. the examples in "The Practice Of Programming" - but it will take a lot longer to write, and you are more likely to make mistakes. This also assumes that all the building blocks you need are available in their most up to date and efficient from in procedural C as well, which is quite an assumption these days. Most likely making use of the STL or the JFC will in practice save you cpu time as well as development time.
As for functional languages, I remember hearing functional programming enthusiasts point out how much easier to use their languages were than the competition, and then observing that those members of the class who chose a functional language were still struggling when those who wrote in Fortran 77 had finished and gone on to draw graphs of the performance of their program. I see that the claims of the functional programming community have not changed. I do not know if the underlying reality has.
Steve314 said it well. OOP is more about the design patterns and organization of large applications. It also lets you deal with unknowns better, which is great for user apps. However, for analyzing algorithms, most likely you are going to be thinking functionally about what you want to do. In that case, I'd stick to more simple PP and not try to create a fully OO design, when you care about the algorithm. I'd want to work with C or Matlab (depending on how math intensive the algorithm is). Just my opinion on it.
I once adapted the Knuth-Morris-Pratt string search algorithm so that I could have an object that would take a character at a time and return a match/no-match status. It wasn't a straight-forward translation.

What are algorithms and data structures in layman’s terms?

I currently work with PHP and Ruby on Rails as a web developer. My question is why would I need to know algorithms and data structures? Do I need to learn C, C++ or Java first? What are the practical benefits of knowing algorithms and data structures? What are algorithms and data structures in layman’s terms? (As you can tell unfortunately I have not done a CS course.)
Please provide as much information as possible and thank you in advance ;-)
Data structures are ways of storing stuff, just like you can put stuff in stacks, queues, heaps and buckets - you can do the same thing with data.
Algorithms are recipes or instructions, the quick start manual for your coffee maker is an algorithm to make coffee.
Algorithms are, quite simply, the steps by which you do something. For instance the Coffee Maker Algorithm would run something like
Turn on Coffee Maker
Grind Coffee Beans
Put in filter and place coffee in filter
Add Water
Start brewing process
Drink coffee
A data structure is a means by which we store information in a organized fashion. For further info, check out the Wikipedia Article.
An algorithm is a list of instructions and data structures are ways to represent information. If you're writing computer programs then you're already using algorithms and data structures even if you don't know what the words mean.
I think the biggest advantages in knowing standard algorithms and data structures are:
You can communicate with other programmers using a common language.
Other people will be able to understand your code once you've left.
You will also learn better methods for solving common problems. You could probably solve these problems eventually anyway even without knowing the standard way to do it, but you will spend a lot of time reinventing the wheel and it's unlikely your solutions will be as good as those that thousands of experts have worked on and improved over the years.
An algorithm is a sequence of well defined steps leading to the solution of a type of problem.
A data structure is a way to store and organize data to facilitate access and modifications.
The benefit of knowing standard algorithms and data structures is they are mostly better than you yourself could develop. They are the result of months or even years of work by people who are far more intelligent than the majority of programmers. Knowing a range of data structures and algorithms allows you to fit a problem roughly to a data structure or/and algorithm and tweak as required.
In the classic "cooking/baking equivalent", algorithms are recipes and data structures are your measuring cups, your baking sheets, your cookie cutters, mixing bowls and essentially any other tool you would be using (your cooker is your compiler/interpreter, though).
(source: mit.edu)
This book is the bible on algorithms. In general, data structures relate to how to organize your data to access it in memory, and algorithms are methods / small programs to resolve problems (ex: sorting a list).
The reason you should care is first to understand what can go wrong in your code; poorly implemented algorithms can perform very badly compared to "proven" ones. Knowing classic algorithms and what performance to expect from them helps in knowing how good your code can be, and whether you can/should improve it.
Then there is no need to reinvent the wheel, and rewrite a buggy or sub-optimal implementation of a well-known structure or algorithm.
An algorithm is a representation of the process involved in a computation.
If you wanted to add two numbers then the algorithm might go:
Get first number;
Get second number;
Add first number to second number;
Return result.
At its simplest, an algorithm is just a structured list of things to do - its use in computing is that it allows people to see the intent behind the code and makes logical (as opposed to syntactical) errors easier to spot.
e.g. if step three above said multiply instead of add then someone would be able to point out the error in the logic without having to debug code.
A data structure is a representation of how a system's data should be referenced. It might match a table structure exactly or may be de-normalised to make data access easier. At its simplest it should show how the entities in a system are related.
It is too large a topic to go into in detail but there are plenty of resources on the web.
Data structures are critical the second your software has more than a handful of users. Algorithms is a broad topic, and you'll want to study it if a good knowledge of data structures doesn't fix your performance problems.
You probably don't need a new programming language to benefit from data structures knowledge, though PHP (and other high level languages) will make a lot of it invisible to you, unless you know where to look. Java is my personal favorite learning language for stuff like this, but that's pretty subjective.
My question is why would I need to know algorithms and data structures?
If you are doing any non-trivial programming, it is a good idea to understand the class data structures and algorithms and their uses in order to avoid reinventing the wheel. For example, if you need to put an array of things in order, you need to understand the various ways of sorting, so that you can choose the most appropriate one for the task in hand. If you choose the wrong approach, you can end up with a program that is grossly inefficient in some circumstances.
Do I need to learn C, C++ or Java first?
You need to know how to program in some language in order to understand what the algorithms and data structures do.
What are the practical benefits of knowing algorithms and data structures?
The main practical benefits are:
to avoid having to reinvent the wheel all of the time,
to avoid the problem of square wheels.

Resources