Best separator character for Hadoop files - hadoop

If I'm writing csv style files out of a system to be consumed by Hadoop. What is the best column separator to use within the file? I have tried ctrl-A but it's a pain imo because other programs don't necessarily show it, eg I might view the file using vi, notepad, web browser, excel. Comma is a pain because the data might also contains commas. I was thinking of standardising on tab. Is there a best practice for this in regards to Hadoop or doesn't it matter. I have done a fair bit of searching and can't find much on this fairly basic question.

There are certainly tradeoffs to each. It really depends what you care most about.
Commas- if you care about interoperability. Every tool works with CSV. commas in the data are a pain only if the writing system doesn't escape properly, or the reading system doesn't respect the escaping. Hive handles escaping correctly, as far as I know.
Tabs- if you care about interoperability and expect commas in data but no tabs. You're slightly less likely to have tabs in the data, but also slightly less likely that any given tool supports TSV.
Ctrl+A- if you care only about hadoop-ecosystem functionality. This has definitely become the de-facto hadoop standard, but hadoop also easily supports commas and tabs. Upside is you usually don't have to care about escaping.
In the end, I think it's usually a toss-up, assuming you're escaping correctly (and you should be!). There's no best practice. If you find yourself worrying a lot about this kind of thing, you might also want to step up to a more serious serialization format, like Avro, which is very well-supported in Hadoop-world.

Related

will obfuscation make my program more optimized

I am implementing a DES Encryption algorithm using C++, I benchmark it on a very large document(1.1MB) plaint text.
I have now reached about 1.1 sec on encryption, I need to squeeze off more performance out of it.
I was thinking of obfuscation, will that help in optimizing my code?
I think optimizing your code is the best way to optimize it:
Fix redundant code
Rethink the logic
Remove unused or trivial variables
Store commonly used values in variables to reduce redundant computation
Obfuscation makes code harder to read by:
Replacing variable names with underscores or single letters (compilers don't use variable names)
Removing whitespace to create a neutron star of unreadable text (compilers do this internally)
Removing comments (compilers don't read comments)
Sometimes adding useless code to further hinder readability (making your program run slower)
Well, you did not write what kind of obfuscation you have in mind (on a source code level?), but generally: no, it won't. In a language like Javascript (or very old interpreted basic dialects), sometimes obfuscation and optimization go hand-in-hand (shorten variable names, deleting unnecessary whitespace/indentation etc.), but not in a compiled language like C++.
Of course, sometimes some kind of misguided optimization will lead to obfuscated code, but that is a different thing.
C++ compilers nowadays are really REALLY smart. Major optimizations come at a macroscopic level. Even Blender's example, removing unused variables, is not needed, since the optimizer will remove them anyway.
Obfuscation doesn't make your code smarter, it doesn't change algorithms, it doesn't introduce dynamic programming, or anything of the sort.
I don't see why you would want that though. With compiled languages, you don't have to ship the source code, you can, if needed, ship headers and libraries, but those don't give away implementation details.

good programming practice

I'm reading Randall Hyde's Write great code (volume 2) and I found this:
[...] it’s not good programming practice to create monolithic
applications, where all the source code appears in one source file (or is processed by a single compilation) [...]
I was wondering, why is this so bad?
Thanks everyone for your answers, I really wish to accept more of them, but I've chosen the most synthetic, so that who reads this question finds immediately the essentials.
Thanks guys ;)
The main reason, more important in my opinion than the sheer size of the file in question, is the lack of modularity. Software becomes easy to maintain if you have it broken down into small pieces whose interactions with each other are through a few well defined places, such as a few public methods or a published API. If you put everything in one big file, the tendency is to have everything dependent on the internals of everything else, which leads to maintenance nightmares.
Early in my career, I worked on a Geographic Information System that consisted of over a million lines of C. The only thing that made it maintainable, the only thing that made it work in the first place is that we had a sharp dividing line between everything "above" and everything "below". The "above" code implemented user interfaces, application specific processing, etc, and everything "below' implemented the spatial database. And the dividing line was a published API. If you were working "above", you didn't need to know how the "below" code worked, as long as it followed the published API. If you were working "below", you didn't care how your code was being used, as long as you implemented the published API. At one point, we even replaced a huge chunk of the "below" side that had stored stuff in proprietary files with tables in SQL databases, and the "above" code didn't have to know or care.
Because everything gets so crammed.
If you have separate files for separate things then you'll be able to find and edit it much faster.
Also, if there's an error, you can locate it more easily.
Because you quickly get (tens of) thousands of lines of unmaintainable code.
Impossible to read, impossible to maintain, impossible to extend.
Also, it's good for team work, since there are a lot of small files which can be edited in parallel by different developers.
Some reasons:
It's difficult to understand something that's all jammed together like that. Breaking
the logic into functions makes it easier to follow.
Factoring the logic into discrete components allows you to re-use those components elsewhere. Sometimes "elsewhere" means "elsewhere in the same program."
You cannot possibly unit-test something that's built in a monolithic fashion. Breaking
a huge function (or program) into pieces allows you to test those pieces individually.
While I understand that HTML and CSS aren't 'programming languages,' I suppose you could look at it in terms of having all of the .css, .js and .html all on one page: It makes it difficult to recycle or debug code.
Including all the source code in a single file makes it hard to manage code. The better approach should be to divide your program in separate modules. Each module should have its own source file and all the modules should be linked finally to create the executable. This helps a lot in maintaining code and makes it easy to find module specific bugs.
Moreover, if you have 2 or more people working simultaneously on the same project, there is a high probability that you will be wasting lots of your time in resolving code conflicts.
Also if you have to test individual components (or modules), you can do it easily without bothering about the other independent modules (if any).
Because decomposition is a computer science fundamental: Solve large problems by decomposing them into smaller pieces. Not only are the smaller problems more manageable, but they make it easier to keep a picture of the entire problem in your head.
When you talk about a monolithic program in a single file, all I can think of are the huge spaghetti code that found its way into COBOL and FORTRAN way back in the day. It encourages a cut and paste mentality that only gets worse with time.
Object-oriented languages are an attempt to help with this. OO languages decompose problems software components, with data and functions encapsulated together, that map to our mental models of how the world works.
Think about it in terms of the Single Responsibility Principle. If that one large file, or one monolithic application, or one "big something" is responsible for everything concerning that system, it's a catch-all bucket of functionality. That functionality should be separated into its discrete components for maintainability, re-usability, etc.
Programmers do a thousand line of codes right, if you do things on separate side. you can easily manage and locate your files.
Apart from all the above mentioned reasons, typically all the text from a compilation unit is included in final program. However if you split things up and it turns out all code in particular file is not getting used it will not be linked. Even if its getting linked it might be easy to turn it into a DLL should you want to decide using the functionality at run time. This facilitates dependency management, reduces build times by compiling only the modified source file leading to greater maintainability and productivity.
From some really bad experience I can tell you that it's unreadable. Function after function of code, you can never understand what the programmer meant or how to find your way out of it.
You cannot even scroll in the file because a little movement of the scroll bar scrolls two pages of code.
Right now I'm rewriting an entire application because the original programmers thought it a great idea to create 3000-line code files. It just cannot be maintained. And of course it's totally untestable.
I think you have never worked in big company that has 6000 lines of code in one file... and every couple thousand lines you see comments like /** Do not modify this block, it's critical */ and you think the bug assigned to you is coming from that block. And your manager says, "yeah look into that file, it's somewhere in it."
And the code breaks all those beautiful OOP concepts, has dirty switches. And like fifty contributors.
Lucky you.

Getting started with massive data

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!
Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.
I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.
Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

Is using a finite state machine a good design for general text parsing?

I am reading a file that is filled
with hex numbers. I have to identify a
particular pattern, say "aaad" (without quotes) from
it. Every time I see the pattern, I
generate some data to some other file.
This would be a very common case in designing programs - parsing and looking for a particular pattern.
I have designed it as a Finite State Machine and structured structured it in C using switch-case to change states. This was the first implementation that occured to me.
DESIGN: Are there some better designs possible?
IMPLEMENTATION: Do you see some problems with using a switch case as I mentioned?
A hand-rolled FSM can work well for simple situations, but they tend to get unwieldy as the number of states and inputs grows.
There is probably no reason to change what you have already designed/implemented, but if you are interested in general-purpose text parsing techniques, you should probably look at things like regular expressions, Flex, Bison, and ANTLR.
For embarrassingly simple cases couple of if's or switch'es are sufficient.
For parsing a string on POSIX systems, man regex (3). For fully featured parsing of whole files (e.g. complex configs) use Lex/Flex and Yacc/Bison.
When writing in C++, look at Boost Regex for the simpler case and Boost Spirit for more complex one. Flex & Bison work with C++ too.

Word wrap algorithms for Japanese

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.
Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.
How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?
Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.
I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.
Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).
budou (Python): https://github.com/google/budou
mikan (JS): https://github.com/trkbt10/mikan.js
mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp
mikan has regex-based approach while budou uses natural language processing.

Resources