How to change the HADOOP log files location - hadoop

I'm running a hadoop process, which take couple of hours and lots of space, and the process stops because there is not much space. The Hadoop tmp folder has huge space remaining so I think it is the problem with the Hadoop_log_files dir, as I've checked and there is not much space there. So could anyone please advise how to change the hadoop log file location to be in another location instead of /home/hduser/hadoop/logs without having to change the whole location of the hadoop setup. I'd be very thankful for any assistance.

I found a property in the hadoop-env.sh:
# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
I changed it to export newlocation/logs and this solved the problem :)

Related

Distributing data on cluster (using torrents?)

I hope this is a good place to ask this, otherwise please redirect me to the correct forum.
I have a large amount of data (~400GB) I need to distribute to all nodes in a cluster (~100 nodes). Any help into how to do this will be appreciated, following here is what Ive tried.
I was thinking of doing this using torrents but I'm running into a bunch of issues. These are the steps I tried:
I downloaded ctorrent to create the torrent and seed and download it. I had a problem because I didn't have a tracker.
I found that qbittorrent-nox has an embedded tracker so I downloaded that on one of my nodes and set the tracker up.
I now created the torrent using the tracker I created and copied it to my nodes.
When I run the torrent with ctorrent on the node with the actual data on it to seed the data I get:
Seed for others 72 hours
- 0/0/1 [1/1/1] 0MB,0MB | 0,0K/s | 0,0K E:0,1 Connecting
When I run on one of the nodes to download the data I get:
- 0/0/1 [0/1/0] 0MB,0MB | 0,0K/s | 0,0K E:0,1
So it seems they aren't connecting to the tracker ok, but I don't know why
I am probably doing something very wrong, but I can't figure it out.
If anyone can help me with what I am doing, or has any way of distributing the data efficiently, even not with torrents, I would be very happy to hear.
Thanks in advance for any help available.
but the node thats supposed to be seeding thinks it has 0% of the file, and so it doesn't seed.
If you create a metadata file (.torrent) with tool A and then want to seed it with tool B then you need to point B to both the metadata and the data (the content files) itself.
I know it is a different issue now, and might require a different topic, but Im hoping you might have ideas.
You should create a new question which will have more room for you to provide details.
So this is embarrassing, I might have had it working for a while now, but I did change my implementation since I started. I just re-checked and the files I was transferring were corrupted on one of my earlier tries and I have been using them since.
So to sum up this is what worked for me if anybody else ends up needing the same setup:
I create torrents using "transmission-create /path/to/file/or/directory/to/be/torrented -o /path/to/output/directory/output_file_name.torrent" (this is because qbittorrent-nox doesn't provide a tool that I could find to create torrents)
I run the torrent on the computer with the actual files so it will seed using "qbittorrent-nox ~/path/to/torrent/file/name_of_file.torrent"
I copy the .torrent file to all nodes and run "qbittorrent-nox ~/path/to/torrent/file/name_of_file.torrent" to start downloading
qbittorrent settings I needed to configure:
In "Downloads" change "Save files to location" to the location of the data in the node that is going to be seeding #otherwise that node wont know it has the files specified in the torrent and wont seed them.
To avoid issues with the torrents sometimes starting as queued and requiring a "force resume". This doesn't appear to have fixed the problem 100% though
In "Speed" tab uncheck "Enable bandwidth management (uTP)"
uncheck "Apply rate limit to uTP connections"
In "BitTorrent" tab uncheck "Torrent Queueing"
Thanks for all the help and Im sorry I hassled people for no reason from some point..

How do I update my hadoop instance after I have changed the source code?

I am using hadoop v1.2.1 and have made a source code change for the project I am working on. The change was to the TaskReport and TaskInProgress classes so additional information would come back in the TaskReport object. I compiled the changes and re-packaged the hadoop-core-1.2.1.jar file and replaced the existing hadoop-core-1.2.1.jar file in the folder where I had unpackaged my hadoop installation.
The map reduce program that I submit to hadoop sees the new properties I added, but the JobTracker doesn't seem to be populating the properties with any data when it creates the TaskReport objects. Do I need to do anything special to get the JobTracker to see these changes, or am I updating hadoop in an incorrect way?
I figured this out - I needed to restart the hadoop services. From a terminal within the hadoop install folder:
bin/stop-all.sh
bin/start-all.sh

How to delete all the data and metadata in HBase WITHOUT uninstalling and reinstalling?

Running HBase in pseudo-distributed mode on my dev box. Cloudera CDH4. CentOS.
Somehow, my HBase installation has gotten totally corrupted. I ran this command :
./bin/hbase hbck -repairHoles
and the readout ended with this :
Summary:
-ROOT- is okay.
Number of regions: 1
Deployed on: localhost.localdomain,60020,1340917622717
.META. is okay.
Number of regions: 1
Deployed on: localhost.localdomain,60020,1340917622717
5 inconsistencies detected.
Looking at the documentation here :
http://hbase.apache.org/book/apbs03.html
it says this :
If inconsistencies still remain after these steps, you most likely have table integrity problems related to orphaned or overlapping regions.
Basically, I have no interest in digging in and trying to fix this. I want to completely nuke my HBase installation and start over fresh and clean. HOWEVER, I do not want to do an uninstall/reinstall, because we use Cloudera, and I don't want to mess with their whole weird configuration and setup.
Is there a way to delete all the data and metadata in HBase WITHOUT uninstalling and reinstalling?
I do not recommend this unless you are at the point of no return.
I do not know if this is the correct way to nuke the hbase data, but when I run into such inconsistencies I usually delete all the contents the directory which is holding hbase data.
So the place would be look for the following property in hbase-site.xml
hbase.rootdir
I have not used this approach once the system got stable on my local dev machine. Usually if I shut down the cluster properly before shutting down the system, then I do not run into such problems.
the answer above isn't the whole story, I found this with my hbase today.
if you are running with zookeepers, you also need to delete those data kept by the zookeeper,
as I've posted in this question
https://stackoverflow.com/a/51857841/8428146

Hadoop can not setTime to a directory, why?

I'm running into a Hadoop issue.When I run my Hadoop testing program for changing the access time and modify time of a directory which on the hadoop file system,some errors occured. And I have no idea about it. So,hope for anyone's any useful advice.
In most versions of Hadoop it is indeed not possible to set the times of a directory. See this Hadoop ticket for the details HDFS-2436. The ticket will tell you what version you need to do that.
Note however that Hadoop does not support access times for directories at all, as far as I know.

How come the unix locate command still shows files/folders that aren't there any more?

I recently moved my whole local web development area over to using MacPorts stuff, rather than using MAMP on my Mac. I've been getting into Python/Django and didn't really need MAMP any more.
Thing is, I have uninstalled MAMP from the Applications folder, with the preferences file too, but how come when I run the 'locate MAMP' command in the Terminal it still shows all my /Applications/MAMP/ stuff as if it's all still there? And when I 'cd' into /Applications/MAMP/ it doesn't exist?
Something to do with locate being a kind of index searching system, hence things these old filepaths are cached? Please explain why, and how to sort it so they don't show anymore.
You've got the right idea: locate uses a database called 'locatedb'. It's normally updated by system cron jobs (not sure which on OS X); you can force an update with the updatedb command. See http://linux-sxs.org/utilities/updatedb.html among others.
Also, if you don't find files which you expect to, note this important caveat from the BUGS section of OSX' locate(1) man-page:
The locate database is typically built by user ''nobody'' and the
locate.updatedb(8) utility skips directories which are not readable
for user ''nobody'', group ''nobody'', or world. For example, if your
HOME directory is not world-readable, none of your files are in the database.
The other answers are correct about needing to update the locate database. I've got this alias to update my locate DB:
alias update_locate='sudo /usr/libexec/locate.updatedb'
I actually don't use locate all that much anymore now that I've found mdfind. It uses the spotlight file index which OSX is much better at keeping up to date compared to the locatedb. It also has quite a bit more power in what it can search from the command line.
Indeed the locate command searches through an index, that's why it's pretty fast.
The index is generated by the updatedb command, which is usually run as a nightly
or weekly job.
So to update it manually, just run updatedb.
According to the man page, its database is updated once a week:
NAME
locate.updatedb -- update locate database
SYNOPSIS
/usr/libexec/locate.updatedb
DESCRIPTION
The locate.updatedb utility updates the database used by locate(1). It is typically run once a week by
the /etc/periodic/weekly/310.locate script.
Take a look at the locate man page
http://unixhelp.ed.ac.uk/CGI/man-cgi?locate+1
You'll see that locate searches a database, not your actual filesystem.
You can update that database by using the updatedb command.
Also, since it's a database, unless you do update it regularly, locate wouln't find files that are in your filesystem that arn't in the database.

Resources