How to Get Pig to Work with lzo Files? - hadoop

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote cluster, etc...
That said, my goal is just to get my local computer (a mac) to make pig work with lzo compressed files that exist on a Hadoop cluster that's already been setup to work with lzo files. I already have Hadoop installed locally and can get files from the cluster with hadoop fs -[command].
I also already have pig installed locally and communicating with the hadoop cluster when I run scripts or when I just run stuff through grunt. I can load and play around with non-lzo files just fine. My problem is only in terms of figuring out a way to load lzo files. Maybe I can just process them through the cluster's instance of ElephantBird? I have no idea, and have only found minimal information online.
So, any sort of short tutorial or answer for this would be awesome, and would hopefully help more people than just me.

I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!
NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).
Hooking PIG up to be able to work with LZOs
This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:
Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile
this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path
point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
<property>
<name>mapred.child.env</name>
<value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

Related

Validation test failure for build of jq-1.5 on FreeBSD 9.x... Now what?

A note in the README file said to ask questions here, so I am doing so.
The RIPEstat service has just shut off their own port 43 plain text service and now is forcing everyone to access their data using jq. I have zero experience with or knowledge of jq, but I am forced to give it a try. I have just built the thing successfully from sources (jq-1.5) on my crusty old FreeBSD 9.x system and the build completed OK, but one of the post-build verification tests (tests/onigtest) failed. I am looking at the test-suite.log file but none of what's in there means anything to me. (Unfortunately, I am new to stackoverflow also, and thus, I have no idea how to even upload a copy of that here so that the maintainer can peruse it.)
So, my questions:
1) Should I even worry about the failure of tests/onigtest?
2) If I should, then what should I do about this failure?
3) What is the best and/or most proper way for me to get a copy of the test-suite.log file to the maintainer(s)?
Should I even worry about the failure of tests/onigtest?
If the only failures are related to onigtest, then most likely only the regex filters will be affected.
what should I do about this failure?
According to the jq download page, there is a pre-built binary for FreeBSD, so you might try that.
From your brief description, it's not clear to me what exactly you did, but if you haven't already done so, you might also consider building an executable from a git clone of "master" as per the guidelines on the download page; see also https://github.com/stedolan/jq/wiki/Installation#or-build-from-source
What is the best and/or most proper way for me to get a copy of the test-suite.log file to the maintainer(s)?
You could create a ticket at https://github.com/stedolan/jq/issues

Find files created or modified by an installer

To track changes in OSX filesystem while an installer runs I'm trying to use the fs_usage.
Can somebody guide me with a simple example on how to interpret the result. There a lot of terms I don't understand when I examine it.
fs_usage's output tends to be full of irrelevant chatter, and hard to interpret. I'd recommend using fseventer, which just shows changed files without all the nonsense. If you're an Apple developer, you can also use PackageMaker's snapshot package feature (which records everything that happens, and offers to make an installer package that does the same things).

Hadoop can not setTime to a directory, why?

I'm running into a Hadoop issue.When I run my Hadoop testing program for changing the access time and modify time of a directory which on the hadoop file system,some errors occured. And I have no idea about it. So,hope for anyone's any useful advice.
In most versions of Hadoop it is indeed not possible to set the times of a directory. See this Hadoop ticket for the details HDFS-2436. The ticket will tell you what version you need to do that.
Note however that Hadoop does not support access times for directories at all, as far as I know.

create project with compass/sass

I'm having a problem getting started with compass/sass. I eventually managed to install compass, although I had to google around because the instructions on the compass website didn't work for me.
Next step was to create a project. I thought this would be simple enough by typing:
$ compass create path/to/project --using blueprint/basic --sass-dir=sass --css-dir=css
Unfortunately, this didn't work. The first thing to fail was that it told me that --using was not a recognised command (even though that is exactly what it tells you to type in the compass installation instructions). So, I tried creating the project taking away all three of the additional options.
This did create a project, although not in the place I specified. Rather than placing it in path/to/project it created the files and directories straight into my home folder ie /Users/me/
I must be doing something wrong, I can't believe that a tool designed to save time and make life easier could be so difficult to get up and running. I'm not great at using the command line, but I am able to follow instructions!
Any pointers would be appreciated!
It sounds like your running compass v0.8, please upgrade to v0.10 and that command will work.

Can I install postgresql8.2 via command prompt or running any batch or registry file?

Is it possible to install the entire database(postgresql8.2) via command prompt or batch file or registry file bypassing the trivial procedure for installation. But then to a question comes that, how can we supply default parameters such as name,password,language,default location of database? Currently I'm working on 'Windows XP' platform.
Thank you.
For 8.3 and lower the obvious answer is: http://pginstaller.projects.pgfoundry.org/ which supports or supported silent installations. For more recent versions, please read: http://forums.enterprisedb.com/posts/list/2135.page
Use of existing installers would simplify your life and be where I would start.
This being said there is no reason you can't generate a script to register dll's properly run initdb, etc. This will take some extra knowledge of both PostgreSQL and Windows, and will be mostly suitable for custom solutions (i.e. not cases where you merely are packaging software that runs with PostgreSQL). I don't think an complete answer can be given here because once you need such a solution you need to design your installation around if. Books could be written on that topic. The docs http://www.postgresql.org/docs/9.0/static/install-windows.html should get you started however since the only difference really between installing from source and installing from the precompiled source is just that you need to compile the source files first.
Failing that you could take a look at the binary zip packages. Typically these can be extracted and PostgreSQL can be run from inside them.

Resources