Getting started with massive data

Getting started with massive data - hadoop

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!

Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.

I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.

Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

Related

why is functional language good for big data?

I'm currently working in a bank, and working with Q(kdb+, K whatever its called). I know that this is a functional language, and I also know that a lot of organizations use functional language to deal with large data sets.
I wonder why is the functional language (programming) good for big data? Is it because of the way they compile the code, or some other reasons.
Also, if the idea is wrong, can anyone explain why its wrong?
ps: If there are similar questions, forgive me :P

One of the reasons is that having immutable variables let's you execute code in parallel and scale very easy.

Zdravko is right about immutable state making concurrency easier and less prone to race condition style bugs. However, that helps only with multi-threaded concurrency. When you talk about big data, you are talking about horizontally scaled cluster computing. Not much support for that in Functional Programming languages.
There is something about FP that has captured the imagination of developers with Big Data dreams. Maybe it has something to do with FP's stream oriented higher order functions which let you think in terms of processing data streams. With FP, you solve problems with languaging such as union, intersection, difference, map, flatmap, and reduce.
But FP alone won't work in a distributed computing environment. At OSCON 2014, I learned about some open source projects that integrate FP languages with Hadoop. See Functional Programming and Big Data for a comparative evaluation of three such projects getting traction there days; Netflix Pig Pen, Cascalog, and Apache Spark.

Are there any resources for language independent performance tips?

I work with many people that program video games for a living. I have a quite a bit of knowledge in C++ and I know a number of general performance strategies to utilize in day to day programming. Like using prefix ++/-- over post fix.
My problem is that often times people come to me to give them tips on general optimizations they can do on a regular basis when programming, but often times these people program in all sorts of languages. Some use C++, C#, Java, ActionScript, etc.
I am wondering if there are any general performance tips that can be utilized on a day by day programming basis? For example, I would suggest prefix ++/-- over postfix for people programming in another language, but I am just not sure if that is true.
My guess is that it is language specific and the best way to go about general optimizations is to make sure you are not using majorly bloated algorithms, but maybe someone has some advice.

Without going into language specifics, or even knowing whether this is embedded, web, CAD, game, or iPhone programming, there isn't much that can be said. All we know is that there's multiple languages involved, and for some unknown reason performance is always slower than desirable.
First, check your algorithms. A slow algorithm can cause horrible performance. Read up on algorithms and their complexity.
Second, note if there are any really slow operations, such as hitting a database or transmitting information or moving a robot arm. See if the program is doing more of those than it should.
Third, profile. If there's a section of code that's taking 5% of the time, no optimization will make your program more than 5% faster. If a section of code is taking a lot of the time, it's worth looking at.
Fourth, get somebody who knows what they're doing to make any specific optimizations. Test them when they're done to make sure they actually speed up performance. When performance was an issue, I've improved it with some counterintuitive measures, like rolling up loops.

I don't think you can generalize optimization as such. To optimize execution time, you need to dig deep into the language and understand how things work in detail. Just guessing or making assumptions on experiences with other languages won't work! For example, writing x = x << 1 instead of x = x*2 might be a big benefit in C++. In JavaScript it will slow you down.
With all the differences between all the languages it's hard to find generic optimization tips. Maybe for some languages which are similar (f.ex. C# and Java). But if you add both JavaScript and Python to that list I'm pretty sure not many common optimization techniques will be left over.
Also keep in mind that premature optimization is often considered bad practice. Developer-hours are much more expensive than buying additional hardware.
However, there is one thing which comes to mind. Over the past decade or so, Object Relational Mappers have become quite popular. And hence, they emerge(d) in pretty much all popular languages. But you have to be careful with those. It's easy to load tons of data into memory that you will never use in your code if not properly configured. Keep that in mind. Lazy loading might be of some help here. But your mileage will vary.
Optimization depends on so many things that answering such a generic question would make this post explode into a full-fledged paper. In my opinion, optimization should be regarded on a project-by-project basis. Not only Language-by-Language basis.

I think you need to split this into two separate questions:
1) Are there language-agnostic ways to find performance problems? YES. Profile, but avoid the myths around that subject.
2) Are there language-agnostic ways to fix performance problems? IT DEPENDS.
A general language-agnostic principle is: do (1) before you do (2).
In other words, Ready-Aim-Fire, not Ready-Fire-Aim.
Here's an example of performance tuning, in C, but it could be any language.

A few things I have learned since asking this:
I/O operations are usually the most expensive to performance. This holds especially true when you are doing disk or network I/O (which is usually the most expensive because if you have to wait for a response from the other host you have to wait for all processing and I/O operations the remote host does). Only do these operations when absolutely necessary and possibly consider using a cache when possible.
Database operations can be very expensive because of network/disk I/O and the translation time to and from SQL. Using in-memory DB or cache can help reduce I/O issues and some (not all) NoSQL databases can reduce SQL translation time.
Only log important information. Using logging libraries like log4j can help because you can put logging to your hearts desire in your application but you set each message to a certain log level. Whichever log level you set the application to it will only log messages at that level or higher. This way if you need to troubleshoot functionality you only have to change a quick config and restart you application to give you additional messages. Then when you are done just turn you application back to the default level so that you do not log too often.
Only include functionality that is needed. Additional functionality may be nice to have but can increase processing time, provide additional locations for the application to fail, and costs your team development time that could be spent on more important tasks.
Use and configure your memory manager correctly. Garbage collection routines can kill performance if they are not configured correctly. If every minute you application freezes for a second or two for garbage collection your customer probably will not be happy.
Profile only after you have discovered a performance issue. Profilers will make the applications performance look worse than it is because you have your application and the profiler running on the same host, consuming the same hardware resources.
Do not prematurely do performance tuning. There are general practices you can take that should be better on performance in each language, but starting performance tuning in the middle of application development can cost you a lot on development because there is still functionality to be added.
This is not necessarily going to help performance but keep class dependency to a minimal. When you get into performance tuning there is good chance you will have to rewrite whole portions of code, which if there is a lot of dependencies on the section you are performance tuning the greater chance you will break the code. It can often be a domino affect because after fixing the performance issue than you have to fix all the dependencies, and possibly dependencies of the original dependencies. A performance tuning exercise estimate for a few hours can quickly turn into months with an application that has a lot of dependencies.
If performance is a concern do not use interpreted languages (scripting languages).
Only use the hardware you need. Having a system with a 64 core processor may seem cool but if you only have two or three threads running in your application than you are getting little benefit from having 64 cores. In fact, in rare instances having overly excessive hardware can sometimes hurt performance because the chips have to be wired to handle all the hardware which can cause your application to spend more time switching between cores or processors than actually being processed.
Any timing metrics you report make as granular as possible. Currently, you may only need to be worried about the number of milliseconds a process takes but in the future as you make your application faster and faster you may need more granular timings. If version A uses milliseconds and version B uses microseconds, how can you compare performance if version B is taking about the same number of milliseconds. Version B may be better but you just can't tell because version A did not use granular enough metrics.

What are algorithms and data structures in layman’s terms?

I currently work with PHP and Ruby on Rails as a web developer. My question is why would I need to know algorithms and data structures? Do I need to learn C, C++ or Java first? What are the practical benefits of knowing algorithms and data structures? What are algorithms and data structures in layman’s terms? (As you can tell unfortunately I have not done a CS course.)
Please provide as much information as possible and thank you in advance ;-)

Data structures are ways of storing stuff, just like you can put stuff in stacks, queues, heaps and buckets - you can do the same thing with data.
Algorithms are recipes or instructions, the quick start manual for your coffee maker is an algorithm to make coffee.

Algorithms are, quite simply, the steps by which you do something. For instance the Coffee Maker Algorithm would run something like
Turn on Coffee Maker
Grind Coffee Beans
Put in filter and place coffee in filter
Add Water
Start brewing process
Drink coffee
A data structure is a means by which we store information in a organized fashion. For further info, check out the Wikipedia Article.

An algorithm is a list of instructions and data structures are ways to represent information. If you're writing computer programs then you're already using algorithms and data structures even if you don't know what the words mean.
I think the biggest advantages in knowing standard algorithms and data structures are:
You can communicate with other programmers using a common language.
Other people will be able to understand your code once you've left.
You will also learn better methods for solving common problems. You could probably solve these problems eventually anyway even without knowing the standard way to do it, but you will spend a lot of time reinventing the wheel and it's unlikely your solutions will be as good as those that thousands of experts have worked on and improved over the years.

An algorithm is a sequence of well defined steps leading to the solution of a type of problem.
A data structure is a way to store and organize data to facilitate access and modifications.
The benefit of knowing standard algorithms and data structures is they are mostly better than you yourself could develop. They are the result of months or even years of work by people who are far more intelligent than the majority of programmers. Knowing a range of data structures and algorithms allows you to fit a problem roughly to a data structure or/and algorithm and tweak as required.

In the classic "cooking/baking equivalent", algorithms are recipes and data structures are your measuring cups, your baking sheets, your cookie cutters, mixing bowls and essentially any other tool you would be using (your cooker is your compiler/interpreter, though).

(source: mit.edu)
This book is the bible on algorithms. In general, data structures relate to how to organize your data to access it in memory, and algorithms are methods / small programs to resolve problems (ex: sorting a list).
The reason you should care is first to understand what can go wrong in your code; poorly implemented algorithms can perform very badly compared to "proven" ones. Knowing classic algorithms and what performance to expect from them helps in knowing how good your code can be, and whether you can/should improve it.
Then there is no need to reinvent the wheel, and rewrite a buggy or sub-optimal implementation of a well-known structure or algorithm.

An algorithm is a representation of the process involved in a computation.
If you wanted to add two numbers then the algorithm might go:
Get first number;
Get second number;
Add first number to second number;
Return result.
At its simplest, an algorithm is just a structured list of things to do - its use in computing is that it allows people to see the intent behind the code and makes logical (as opposed to syntactical) errors easier to spot.
e.g. if step three above said multiply instead of add then someone would be able to point out the error in the logic without having to debug code.
A data structure is a representation of how a system's data should be referenced. It might match a table structure exactly or may be de-normalised to make data access easier. At its simplest it should show how the entities in a system are related.
It is too large a topic to go into in detail but there are plenty of resources on the web.

Data structures are critical the second your software has more than a handful of users. Algorithms is a broad topic, and you'll want to study it if a good knowledge of data structures doesn't fix your performance problems.
You probably don't need a new programming language to benefit from data structures knowledge, though PHP (and other high level languages) will make a lot of it invisible to you, unless you know where to look. Java is my personal favorite learning language for stuff like this, but that's pretty subjective.

My question is why would I need to know algorithms and data structures?
If you are doing any non-trivial programming, it is a good idea to understand the class data structures and algorithms and their uses in order to avoid reinventing the wheel. For example, if you need to put an array of things in order, you need to understand the various ways of sorting, so that you can choose the most appropriate one for the task in hand. If you choose the wrong approach, you can end up with a program that is grossly inefficient in some circumstances.
Do I need to learn C, C++ or Java first?
You need to know how to program in some language in order to understand what the algorithms and data structures do.
What are the practical benefits of knowing algorithms and data structures?
The main practical benefits are:
to avoid having to reinvent the wheel all of the time,
to avoid the problem of square wheels.

Minimum CompSci Knowledge Needed for Writing Desktop Apps

Having been a hobbyist programmer for 3 years (mainly Python and C) and never having written an application longer than 500 lines of code, I find myself faced with two choices :
(1) Learn the essentials of data structures and algorithm design so I can become a l33t computer scientist.
(2) Learn Qt, which would help me build projects I have been itching to build for a long time.
For learning (1), everyone seems to recommend reading CLRS. Unfortunately, reading CLRS would take me at least an year of study (or more, I'm not Peter Krumins). I also understand that to accomplish any moderately complex task using (2), I will need to understand at least the fundamentals of (1), which brings me to my question : assuming I use C++ as the programming language of choice, which parts of CLRS would give me sufficient knowledge of algorithms and data structures to work on large projects using (2)?
In other words, I need a list of theoretical CompSci topics absolutely essential for everyday application programming tasks. Also, I want to use CLRS as a handy reference, so I don't want to skip any material critical to understanding the later sections of the book.
Don't get me wrong here. Discrete math and the theoretical underpinnings of CompSci have been on my "TODO: URGENT" list for about 6 months now, but I just don't have enough time owing to college work. After a long time, I have 15 days off to do whatever the hell I like, and I want to spend these 15 days building applications I really want to build rather than sitting at my desk, pen and paper in hand, trying to write down the solution to a textbook problem.
(BTW, a less-math-more-code resource on algorithms will be highly appreciated. I'm just out of high school and my math is not at the level it should be.)
Thanks :)

This could be considered heresy, but the vast majority of application code does not require much understanding of algorithms and data structures. Most languages provide libraries which contain collection classes, searching and sorting algorithms, etc. You generally don't need to understand the theory behind how these work, just use them!
However, if you've never written anything longer than 500 lines, then there are a lot of things you DO need to learn, such as how to write your application's code so that it's flexible, maintainable, etc.

For a less-math, more code resource on algorithms than CLRS, check out Algorithms in a Nutshell. If you're going to be writing desktop applications, I don't consider CLRS to be required reading. If you're using C++ I think Sedgewick is a more appropriate choice.

Try some online comp sci courses. Berkeley has some, as does MIT. Software engineering radio is a great podcast also.
See these questions as well:
What are some good computer science resources for a blind programmer?
https://stackoverflow.com/questions/360542/plumber-programmers-vs-computer-scientists#360554

Heed the wisdom of Don and just do it. Can you define the features that you want your application to have? Can you break those features down into smaller tasks? Can you organize the code produced by those tasks into a coherent structure?
Of course you can. Identify any 'risky' areas (areas that you do not understand, e.g. something that requires more math than you know, or special algorithms you would have to research) and either find another solution, prototype a solution, or come back to SO and ask specific questions.

Moving from 500 loc to a real (eve if small) application it's not that easy.
As Don was pointing out, you'll need to learn a lot of things about code (flexibility, reuse, etc), you need to learn some very basic of configuration management as well (visual source safe, svn?)
But the main issue is that you need a way to don't be overwhelmed by your functiononalities/code pair. That it's not easy. What I can suggest you is to put in place something to 'automatically' test your code (even in a very basic way) via some regression tests. Otherwise it's going to be hard.
As you can see I think it's no related at all to data structure, algorithms or whatever.
Good luck and let us know

I must say that sitting down with a dry old textbook and reading it through is not the way to learn how to do anything effectively, even if you are making notes. Doing it is the best way to learn, using the textbooks as a reference. Indeed, using sites like this as a reference.
As for data structures - learn which one is good for whatever situation you envision: Sets (sorted and unsorted), Lists (ArrayList, LinkedList), Maps (HashMap, TreeMap). Complexity of doing basic operations - adding, removing, searching, sorting, etc. That will help you to select an appropriate library data structure to use in your application.
And also make sure you're reasonably warm with MVC - i.e., ensure your model is separate from your view (the QT front-end) as best as possible. Best would be to have the model and algorithms working on their own, and then put the GUI on top. Or a unit test on top. Etc...
Good luck!

It's like saying you want to move to France, so should you learn french from a book, and what are the essential words - or should you just go to France and find out which words you need to know from experience and from copying the locals.
Writing code is part of learning computer science. I was writing code long before I'd even heard of the term, and lots of people were writing code before the term was invented.
Besides, you say you're itching to write certain applications. That can't be taught, so just go ahead and do it. Some things you only learn by doing.
(The theoretical foundations will just give you a deeper understanding of what you wind up doing anyway, which will mainly be copying other people's approaches. The only caveat is that in some cases the theoretical stuff will tell you what's futile to attempt - e.g. if one of your itches is to solve an NP complete problem, you probably won't succeed :-)

I would say the practical aspects of coding are more important. In particular, source control is vital if you don't use that already. I like bzr as an easy to set up and use system, though GUI support isn't as mature as it could be.
I'd then move on to one or both of the classics about the craft of coding, namely
The Pragmatic Programmer
Code Complete 2
You could also check out the list of recommended books on Stack Overflow.

Image Recognition

I'd like to do some work with the nitty-gritties of computer imaging. I'm looking for a way to read single pixels of data, analyze them programatically, and change them. What is the best language to use for this (Python, c++, Java...)? What is the best fileformat?
I don't want any super fancy software/APIs... I'm looking for the bare basics.

If you need speed (you'll probably always want speed with image processing) you definitely have to work with raw pixel data.
Java has some real disadvantages as you cannot access memory directly which makes pixel access quite slow compared to accessing the memory directly.
C++ is definitely the language of choice for production use image processing. But you can, for example, also use C# as it allows for unsafe code in specific areas. (Take a look at the scan0 pointer property of the bitmapdata class.)
I've used C# successfully for image processing applications and they are definitely much faster than their java counterparts.
I would not use any scripting language or java for such a purpose.

It's very east to manipulate the large multi-dimensional or complex arrays of pixel information that are pictures using high-level languages such as Python. There's a library called PIL (the Python Imaging Library) that is quite useful and will let you do general filters and transformations (change the brightness, soften, desaturate, crop, etc) as well as manipulate the raw pixel data.
It is the easiest and simplest image library I've used to date and can be extended to do whatever it is you're interested in (edge detection in very little code, for example).

I studied Artificial Intelligence and Computer Vision, thus I know pretty well the kind of tools that are used in this field.
Basically: you can use whatever you want as long as you know how it works behind the scene.
Now depending on what you want to achieve, you can either use:
C language, but you will lose a lot of time in bugs checking and memory management when implementing your algorithms. So theoretically, this is the fastest language to do that kind of job, but if your algorithms are not computationnally efficient (in terms of complexity) or if you lose too much time in bugs checking, this is clearly not worth it. So I would advise to first implement your application in another language, and then later you can always optimize small parts of your code with C bindings.
Octave/MatLab: very efficient language, almost as much as C, and you can make very elegant and succinct algorithms. If you are into vectorization, matrix and linear operations, you should go with that. However, you won't be able to develop a whole application with this language, it's more focused on algorithms, but then you can always develop an interface using another language later.
Python: all-in-one elegant and accessible language, used in gigantically large scale applications such as Google and Facebook. You can do pretty much everything you want with Python, any kind of application. It will be perfectly adapted if you want to make a full application (with client interaction and all, not only algorithms), or if you want to quickly draft a prototype using existent libraries since Python has a very large set of high quality libraries, like OpenCV. However if you only want to make algorithms, you should better use Octave/MatLab.
The answer that was selected as a solution is very biaised, and you should be careful about this kind of archaic comment.
Nowadays, hardware is cheaper than wetware (humans), and thus, you should use languages where you will be able to produce results faster, even if it's at the cost of a few CPU cycles or memory space.
Also, a lot of people tends to think that as long as you implement your software in C/C++, you are making the Saint Graal of speedness: this is just not true. First, because algorithms complexity matters a lot more than the language you are using (a bad algorithm will never beat a better algorithm, even if implemented in the slowest language in the universe), and secondly because high-level languages are nowadays doing a lot of caching and speed optimization for you, and this can make your program run even faster than in C/C++.
Of course, you can always do everything of the above in C/C++, but how much of your time are you willing to waste to reinvent the wheel?

Not only will C/C++ be faster, but most of the image processing sample code you find out there will be in C as well, so it will be easier to incorporate things you find.

if you are looking to numerical work on your images (think matrix) and you into Python check out http://www.scipy.org/PyLab - this is basically the ability to do matlab in python, buddy of mine swears by it.

(This might not apply for the OP who only wanted the bare basics -- but now that the speed issue was brought up, I do need to write this, just for the record.)
If you really need speed, it's better to forget about working on the pixel-by-pixel level, and rather see whether the operations that you need to perform could be vectorized. For example, for your C/C++ code you could use the excellent Intel IPP library (no, I don't work for Intel).

It depends a little on what you're trying to do.
If runtime speed is your issue then c++ is the best way to go.
If speed of development is an issue, though, I would suggest looking at java. You said that you wanted low level manipulation of pixels, which java will do for you. But the other thing that might be an issue is the handling of the various file formats. Java does have some very nice APIs to deal with the reading and writing of various image formats to file (in particular the java2d library. You choose to ignore the higher levels of the API)
If you do go for the c++ option (or python come to think of it) I would again suggest the use of a library to get you over the startup issues of reading and writing files. I've previously had success with libgd

What language do you know the best? To me, this is the real question.
If you're going to spend months and months learning one particular language, then there's no real advantage in using Python or Java just for their (to be proven) development speed.
I'm particularly proficient in C++ and I think that for this particular task I can be as speedy as a Java programmer, for example. With the aid of some good library (OpenCV comes to mind) you can create anything you need in a matter of a couple of lines of C++ code, really.

Short answer: C++ and OpenCV

Short answer? I'd say C++, you have far more flexibility in manipulating raw chunks of memory than Python or Java.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio