I am trying to create a bashscript to upload files from the local edge node filesystem to hdfs. I was wondering a good way to add the timestamp in the file. Having some problems with getting timestamp to work.
#!/bin/bash
echo Running upload script to hdfs...
timestamp(){date +"%T"}
hdfs dfs -put /home/myname/folder1/* /user/myname/example_1_$(timestamp).txt
hdfs dfs -put /home/myname/folder2/* /user/myname/example_2_$(timestamp).txt
Using date +%T is not possible as the command result would contain : characters in it like 11:12:45, and creating filenames with : character is not possible in HDFS. See Hadoop-3275.
Try this command in the script,
hdfs dfs -put /home/myname/folder1/* /user/myname/example_1_`date +%H%M%S`.txt
This will create filename like /user/myname/example_1_111245.txt.
Related
I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt
Is there a way to read any file format from HDFS directly by using the HDFS path, instead of having to pull the file locally from HDFS and read it.
You can use cat command on HDFS to read regular text files.
hdfs dfs -cat /path/to/file.csv
To read compressed files like gz, bz2 etc, you can use:
hdfs dfs -text /path/to/file.gz
These are the two read methods that Hadoop supports natively using FsShell comamnds. For other complex file types, you will have to use a more complex way, like, a Java program or something along those lines.
hdfs dfs -cat /path or hadoop fs -cat /path
You have to pull the entire file. Whether you use cat or text commands, the entire file is still being streamed to your shell. There's just no remnant of the file when the command ends. So, if you plan on inspecting the file a few times, it's better to get it
As an hdfs client, you must contact the namenode to acquire all block locations for a particular file.
You can try with hdfs dfs -cat
Usage: hdfs dfs -cat [-ignoreCrc] URI [URI ...]
hdfs dfs -cat /your/path
My aim is to read all the files that starts with "trans" in a directory and convert them into a single file and load that single file into HDFS location
my source directory is /user/cloudera/inputfiles/
Assume that inside that above directory , there are lot of file , but i need all the files that start with "trans"
my destination directory is /user/cloudera/transfiles/
So i tried this command below
hadoop dfs - getmerge /user/cloudera/inputfiles/trans* /user/cloudera/transfiles/records.txt
but the above command is not working .
If i try the below command then it works
hadoop dfs - getmerge /user/cloudera/inputfiles /user/cloudera/transfiles/records.txt
Any suggestion on how do i merge some files from a hdfs location and store that merged single file in another hdfs location
Below is the usage of getmerge command:
Usage: hdfs dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
Optionally addnl can be set to enable adding a newline character at the
end of each file.
It expects directory as first parameter.
you can try cat command like this:
hadoop dfs -cat /user/cloudera/inputfiles/trans* > /<local_fs_dir>/records.txt
hadoop dfs -copyFromLocal /<local_fs_dir>/records.txt /user/cloudera/transfiles/records.txt
This may have been answered somewhere but I haven't found it yet.
I have a simple shell script that I'd like to use to move log files into my Hadoop cluster. The script will be called by Logrotate on a daily basis.
It fails with the following error: "/user/qradar: cannot open `/user/qradar' (No such file or directory)".
#!/bin/bash
#use today's date and time
day=$(date +%Y-%m-%d)
#change to log directory
cd /var/log/qradar
#move and add time date to file name
mv qradar.log qradar$day.log
#load file into variable
#copy file from local to hdfs cluster
if [ -f qradar$day.log ]
then
file=qradar$day.log
hadoop dfs -put /var/log/qradar/&file /user/qradar
else
echo "failed to rename and move the file into the cluster" >> /var/log/messages
fi
The directory /user/qradar does exist and can be listed with the Hadoop file commands.
I can also manually move the file into the correct directory using the Hadoop file commands. Can I move files into the cluster in this manner? Is there a better way?
Any thoughts and comments are welcome.
Thanks
Is the &file a typo on in hadoop dfs -put line?
If not then this is likely your problem, you're running the command hadoop dfs -put /var/log/qradar/ in the background (the ampersand runs the command in the background), then the command file /user/qradar, which the shell is looking for on the local path.
My guess is you meant for the following (dollar rather than ampersand):
hadoop dfs -put /var/log/qradar/$file /user/qradar
I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.
The data will need to reside in HDFS for you to process it with the grep example. You can upload the folders to HDFS using the -put FsShell command:
hadoop fs -mkdir bigboss
hadoop fs -put /usr/bigboss/data* bigboss
Which will create a folder in the current user HDFS directory, and upload each of the data directories to it
Now you should be able to run the grep example over the data