Hadoop can not setTime to a directory, why? - hadoop

I'm running into a Hadoop issue.When I run my Hadoop testing program for changing the access time and modify time of a directory which on the hadoop file system,some errors occured. And I have no idea about it. So,hope for anyone's any useful advice.

In most versions of Hadoop it is indeed not possible to set the times of a directory. See this Hadoop ticket for the details HDFS-2436. The ticket will tell you what version you need to do that.
Note however that Hadoop does not support access times for directories at all, as far as I know.

Related

Find out hadoop vendor

I have inherited a Hadoop installation and I am interested to know how the previous admin installed it and where it came from. I am new to Hadoop, but it appears that the previous admin simple installed Apache Hadoop from source (rather than using Cloudera, Hortonworks, etc).
How can I validate this? The LICENSE.txt file says nothing about Cloudera, Hortonworks, etc, but an absence of something is not validation. If it had come from a commercial vendor, can I be sure that the LICENSE.txt file would have mentioned them by name?
if you run hadoop version
it should tell you what you need to know:
the version, where it's installed etc
if not, then try which hadoop

How to change the HADOOP log files location

I'm running a hadoop process, which take couple of hours and lots of space, and the process stops because there is not much space. The Hadoop tmp folder has huge space remaining so I think it is the problem with the Hadoop_log_files dir, as I've checked and there is not much space there. So could anyone please advise how to change the hadoop log file location to be in another location instead of /home/hduser/hadoop/logs without having to change the whole location of the hadoop setup. I'd be very thankful for any assistance.
I found a property in the hadoop-env.sh:
# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
I changed it to export newlocation/logs and this solved the problem :)

How do I update my hadoop instance after I have changed the source code?

I am using hadoop v1.2.1 and have made a source code change for the project I am working on. The change was to the TaskReport and TaskInProgress classes so additional information would come back in the TaskReport object. I compiled the changes and re-packaged the hadoop-core-1.2.1.jar file and replaced the existing hadoop-core-1.2.1.jar file in the folder where I had unpackaged my hadoop installation.
The map reduce program that I submit to hadoop sees the new properties I added, but the JobTracker doesn't seem to be populating the properties with any data when it creates the TaskReport objects. Do I need to do anything special to get the JobTracker to see these changes, or am I updating hadoop in an incorrect way?
I figured this out - I needed to restart the hadoop services. From a terminal within the hadoop install folder:
bin/stop-all.sh
bin/start-all.sh

How to delete all the data and metadata in HBase WITHOUT uninstalling and reinstalling?

Running HBase in pseudo-distributed mode on my dev box. Cloudera CDH4. CentOS.
Somehow, my HBase installation has gotten totally corrupted. I ran this command :
./bin/hbase hbck -repairHoles
and the readout ended with this :
Summary:
-ROOT- is okay.
Number of regions: 1
Deployed on: localhost.localdomain,60020,1340917622717
.META. is okay.
Number of regions: 1
Deployed on: localhost.localdomain,60020,1340917622717
5 inconsistencies detected.
Looking at the documentation here :
http://hbase.apache.org/book/apbs03.html
it says this :
If inconsistencies still remain after these steps, you most likely have table integrity problems related to orphaned or overlapping regions.
Basically, I have no interest in digging in and trying to fix this. I want to completely nuke my HBase installation and start over fresh and clean. HOWEVER, I do not want to do an uninstall/reinstall, because we use Cloudera, and I don't want to mess with their whole weird configuration and setup.
Is there a way to delete all the data and metadata in HBase WITHOUT uninstalling and reinstalling?
I do not recommend this unless you are at the point of no return.
I do not know if this is the correct way to nuke the hbase data, but when I run into such inconsistencies I usually delete all the contents the directory which is holding hbase data.
So the place would be look for the following property in hbase-site.xml
hbase.rootdir
I have not used this approach once the system got stable on my local dev machine. Usually if I shut down the cluster properly before shutting down the system, then I do not run into such problems.
the answer above isn't the whole story, I found this with my hbase today.
if you are running with zookeepers, you also need to delete those data kept by the zookeeper,
as I've posted in this question
https://stackoverflow.com/a/51857841/8428146

How to Get Pig to Work with lzo Files?

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote cluster, etc...
That said, my goal is just to get my local computer (a mac) to make pig work with lzo compressed files that exist on a Hadoop cluster that's already been setup to work with lzo files. I already have Hadoop installed locally and can get files from the cluster with hadoop fs -[command].
I also already have pig installed locally and communicating with the hadoop cluster when I run scripts or when I just run stuff through grunt. I can load and play around with non-lzo files just fine. My problem is only in terms of figuring out a way to load lzo files. Maybe I can just process them through the cluster's instance of ElephantBird? I have no idea, and have only found minimal information online.
So, any sort of short tutorial or answer for this would be awesome, and would hopefully help more people than just me.
I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!
NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).
Hooking PIG up to be able to work with LZOs
This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:
Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile
this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path
point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
<property>
<name>mapred.child.env</name>
<value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

Resources