How can I tell if a folder is cached or not in Alluxio? And if a folder is cached, how can I uncache it?
you can run bin/alluxio fs ls /path to check the percent of data cached in Alluxio.
To free the file (or dir), run bin/alluxio fs free /path
Related
I have bunch of big zipped(.bz2) file in hadoop/hdfs location and I dont have enough space to bring those in my local and get count, I am looking help to get command to have count of those zipped file in hdfs liek we do in local linux wc -l ***.txt for all the same pattern of file.
My /home directory is having very less memory. But some of my programs which are running in production will create dynamic files in '/home' directory.
The problem is if it reaches to 100% then my program doesn't work. So I have to manually go and delete the files or copy the files.
So rather than doing that I want to redirect the files from '/home' to '/tmp' directory in unix by default.
Please give me some thoughts.
You have at least two ways to do:
if you can config your program to export files to other dir, do this.
if you cannot do anything on the program, you can create a cron job, remove/cp those files automatically
If the program creates files under it's own directory, you can create a symlink:
# Create directory in /tmp
mkdir /tmp/myprog
# Set permissions
chown "${USER}:${USER}" /tmp/myprog
chmod -R o-x /tmp/myprog
# Create symlink at /home/myprog
ln -s /tmp/myprog "${HOME}/myprog"
I can ssh to our box and do a hadoop fs -ls /theFolder and browse in for the files, etc.. but that's all I know too :) My goal is to copy one of those files - they are Avro - on to my local home folder.
How can do this? I found also a get command but not sure how to sue that either.
First, use hadoop fs -get /theFolder to copy it into the current directory you are ssh'ed into on your box.
Then you can use either scp or my preference of rsync to copy the files between your box and your local system like so. Here's how I'd use rsync after having used the -get, still in the same directory:
rsync -av ./theFolder username#yourlocalmachine:/home/username
This will copy theFolder from the local fs on your box into your home folder on your machine's fs. Be sure to replace username with your actual username in both cases, and yourlocalmachine with your machine's hostname or ip address.
Using hadoop's get you can copy the files from HDFS to your box's file system. Read more about using get here.
Then, using scp (this is similar to doing ssh) you may copy those files to your local system. Read more about using scp here.
hadoop fs -get theFolder
is great just like previous answer.
For syncing with local machine, I think you can set up git. That's easy as well.
I currently have an issue adding a folders contents to Hives distrusted cache. I can successfully add multiple files to the distributed cache in Hive using:
ADD FILE /folder/file1.ext;
ADD FILE /folder/file2.ext;
ADD FILE /folder/file3.ext;
etc.
.
I also see that there is a ADD FILES (plural) option which in my mind means you could specify a directory like: ADD FILES /folder/; and everything in the folder gets included (this works with Hadoop Streaming -files option). But this does not work with Hive. Right now I have to explicitly add each file.
Am I doing this wrong? Is there a way to had a whole folders contents to the distributed cache.
P.S. I tried wild cards ADD FILE /folder/* and ADD FILES /folder/* but that fails too.
Edit:
As of hive 0.11 this now supported so:
ADD FILE /folder
now works.
What I am using is passing the folder location to the hive script as a param so:
$ hive -f my-query.hql -hiveconf folder=/folder
and in the my-query.hql file:
ADD FILE ${hiveconf:folder}
Nice and tidy now!
Add doesn't support directories, but as a workaround you can zip the files. Then add the it to the distributed cache as an archive (ADD ARCHIVE my.zip). When the job is running the content of the archive will be unpacked on the local job directory of the
slave nodes (see the mapred.job.classpath.archives property)
If the number of the files you want to pass is relatively small, and you don't want deal with archives you can also write a small script which prepares the add file command for all the files you have in a given directory:
E.g:
#!/bin/bash
#list.sh
if [ ! "$1" ]
then
echo "Directory is missing!"
exit 1
fi
ls -d $1/* | while read f; do echo ADD FILE $f\;; done
Then invoke it from the Hive shell and execute the generated output:
!/home/user/list.sh /path/to/files
Well, in my case, I had to move a folder with child folders and files in it.
I used the ADD ARCHIVE xxx.gz, which was adding the file, but was not exploding(unzipping) in the slave machines.
Instead, ADD FILE <folder_name_without_traling_slash> actually copies the whole folder recursively to the slaves.
Courtesy: The comments helped debugging
Hope this helps !
I am trying to export a folder from my local file system to hdfs . I am running code through R . How may I be able to do it?
Hope for suggestions
You should use the system command to do that easily:
system("hadoop fs -put /path/to/file /path/in/hdfs")
You can also use the rhdfs project, particularly the functions hdfs.write or hdfs.copy which should do the same.