Find out hadoop vendor - hadoop

I have inherited a Hadoop installation and I am interested to know how the previous admin installed it and where it came from. I am new to Hadoop, but it appears that the previous admin simple installed Apache Hadoop from source (rather than using Cloudera, Hortonworks, etc).
How can I validate this? The LICENSE.txt file says nothing about Cloudera, Hortonworks, etc, but an absence of something is not validation. If it had come from a commercial vendor, can I be sure that the LICENSE.txt file would have mentioned them by name?

if you run hadoop version
it should tell you what you need to know:
the version, where it's installed etc
if not, then try which hadoop

Related

Hortonworks Data Platform 2.4 (Sandbox): Directory structure

I downloaded and started an instance of the HDP 2.4 as a VMWare sandbox. I just had a look for the directory structure, and found the there are the following 2 folders:
/usr/hdp/2.4.0.0-169
/usr/hdp/current
Why do I have two folders here as I didn't make any updates yet. I guess that the current folder is used for the platform tools like Spark etc. So if I need to update one of them (e.g. Spark) do I have to make the "changes" to that current folder or on both? Or will an update work completely different? Thank you!
/usr/hdp/current contains symlinks to versioned directories i.e. /usr/hdp/current points to /usr/hdp/2.4.0.0-169

Installing cloudera Hadoop without internet connection

Actually I am trying to install cloudera hadoop cluster with few VMs with CentOS but this project is under secure environment where I can't use internet.I tried with various tutorial but each and every tutorial needed internet connection at some point of time. Few things I have downloaded instead of wget command.But still I couldn't make it.
Can any once share with me how can I do that either using cloudera Manager or manually (without need of any internet connection)??
You can do that by selecting the path B Manual installation of cloudera specified here which provide you the option of downloading the parcels online or specifying them from local repository.
OR
You can install the packages individually by using the path C for installation which is explained here on cloudera documentation.

How do I update my hadoop instance after I have changed the source code?

I am using hadoop v1.2.1 and have made a source code change for the project I am working on. The change was to the TaskReport and TaskInProgress classes so additional information would come back in the TaskReport object. I compiled the changes and re-packaged the hadoop-core-1.2.1.jar file and replaced the existing hadoop-core-1.2.1.jar file in the folder where I had unpackaged my hadoop installation.
The map reduce program that I submit to hadoop sees the new properties I added, but the JobTracker doesn't seem to be populating the properties with any data when it creates the TaskReport objects. Do I need to do anything special to get the JobTracker to see these changes, or am I updating hadoop in an incorrect way?
I figured this out - I needed to restart the hadoop services. From a terminal within the hadoop install folder:
bin/stop-all.sh
bin/start-all.sh

Hadoop can not setTime to a directory, why?

I'm running into a Hadoop issue.When I run my Hadoop testing program for changing the access time and modify time of a directory which on the hadoop file system,some errors occured. And I have no idea about it. So,hope for anyone's any useful advice.
In most versions of Hadoop it is indeed not possible to set the times of a directory. See this Hadoop ticket for the details HDFS-2436. The ticket will tell you what version you need to do that.
Note however that Hadoop does not support access times for directories at all, as far as I know.

How to Get Pig to Work with lzo Files?

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote cluster, etc...
That said, my goal is just to get my local computer (a mac) to make pig work with lzo compressed files that exist on a Hadoop cluster that's already been setup to work with lzo files. I already have Hadoop installed locally and can get files from the cluster with hadoop fs -[command].
I also already have pig installed locally and communicating with the hadoop cluster when I run scripts or when I just run stuff through grunt. I can load and play around with non-lzo files just fine. My problem is only in terms of figuring out a way to load lzo files. Maybe I can just process them through the cluster's instance of ElephantBird? I have no idea, and have only found minimal information online.
So, any sort of short tutorial or answer for this would be awesome, and would hopefully help more people than just me.
I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!
NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).
Hooking PIG up to be able to work with LZOs
This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:
Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile
this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path
point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
<property>
<name>mapred.child.env</name>
<value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

Resources