Use part of filename to add as a field/column - bash

I get 5 files everyday (via wget) saved to /tmp to be loaded to hdfs in a bash script.
donaldDuck-2013-07-20.zip
mickeyMouse-2013-07-20.zip
goofyGoof-2013-07-20.zip
plutoStar-2013-07-20.zip
bigBadWolf-2013-07-20.zip
The date part of the filename is dynamic.
How do I then tell hadoop to load each of the 5 files in? I heard something about a loop.
for file in /tmp/*; do
echo "Running ${file##*/} ...."
done
Do I replace the echo line with the "hadoop fs -put..." statement? How will it look like?

You can do something like:
#!/bin/bash
when=$(date "+%Y-%m-%d") #output like 2013-07-23
names=(donaldDuck mickeyMouse goofyGoof plutoStar bigBadWolf)
for file in "${names[#]}"
do
ls -l $file-$when.zip #output like donaldDuck-2013-07-23.zip
done
Explanation
The names are stored in an array $names. Hence, we can loop through it with for file in "${names[#]}". In parallel, we have the date stored in $when, so that the format is matched with $file-$when.zip.

Here is what I would do:
hdfsdir=/path/to/hdfs/output/dir
datethru=`date "+%Y-%m-%d" --date="3 days ago"` # replace by how many days ago you want
for i in `ls /tmp/*-$datethru.zip`; do
hadoop fs -put $i $hdfsdir
done
This will essentially grab all the files in your directory that contain a specific date and end in .zip, and upload each of these files to a specific directory in hdfs.

Related

How to separate month's worth of timestamped data by day

I have a .log file which restarts at the beginning of each month, each message beginning with the following timestamp format: 01-07-2016 00:00:00:868|
There are thousands of messages per day and I'd like to create a short script which can figure out when the date increments and output each date to a new file with just that day's data. I'm not proficient in bash but I'd like to use sed or awk, as it's very useful for automating processes at my job and creating reports.
Below script will split the input log file into multiple files with the date added as a suffix to the input file name:
split_logfile_by_date
#!/bin/bash
exec < $1
while read line
do
date=$(echo $line|cut -d" " -f 1)
echo $line >> $1.$date
done
Example:
$ ls
log
$ split_logfile_by_date log
$ ls
log log.01-07-2016 log.02-07-2016 log.03-07-2016
awk '{log = FILENAME "." $1; print > log}' logfile
This will write all the 01-07-2016 records to the file logfile.01-07-2016

Unix Shell script archive previous month file

I have files that has following format in directory:
SLS20160112.001 (20160112 stands for YYYYMMDD)
I wish to archive all previous month files, for example:
SLS20160201.001
SLS20150201.001
SLS20160107.001
SLS20160130.001
For the above files listed, i will archive SLS20160107.001 and SLS20160130.001 because from the filename it stamps January.
For the SLS20160201.001 it still remains as i only want to archive previous month file. I can only extract date from the filename, not the mdate or adate.
My current logic is to loop through all files, then get previous month files and then pipe out the filename and tar it. But not sure how to do that part.
for file in SLS*; do
f="${file%.*}"
GET PREVIOUS MONTH FILES AND THEN ECHO
done | tar cvzf SlSBackup_<PREVIOUS_MONTH>.TAR.GZ -T-
It looks like you want to solve the problem by using a shell script. I do a lot of work on Mac so I use csh/tcsh (default shell for OSX), and my answer will be in csh/tcsh script. You can either translate it to bash (your shell) or you can easily spawn a new shell by just typing$ tcsh.
You can write a small shell script which can filter filelist for your desired month.
#!/bin/csh -f
set mon_wanted = '01'
foreach file (`ls -1 SLS*.*`)
set mon = `echo $file | awk '{print substr($0, 8, 2)}'`
if ($mon != $mon_wanted) continue
echo -n $file ' '
end
Let's say the filename is foo.csh. Make it executable by
$ chmod 755 foo.csh
Then,
$ tar cvzf `foo.csh` > out.tar.gz

Bash, issue on for loop

I want to list specified files (files uploaded yesterday) from an amazon S3.
Then I want to loop on this list, and for every single element of the list I want to unzip the file.
My code is:
for file in s3cmd ls s3://my-bucket/`date +%Y%m%d -d "1 day ago"`*
do s3cmd get $file
arrIN=(${file//my-bucket\//})
gunzip ${arrIN[1]}
done
so basicaly arrIN=(${file//my-bucket//}); explodes my string and allow me to retrieve the name of the file I want to unzip.
Thing is, file are downloading but nothing is being unzip, so I tried:
for file in s3cmd ls s3://my-bucket/`date +%Y%m%d -d "1 day ago"`*
do s3cmd get $file
echo test1
done
Files are being downloaded but nothing is being echo. Loop is just working for the first line...
You need to use command substitution to iterate over the result of the desired s3smd ls command.
for file in $(s3cmd ls s3://my-bucket/$(date +%Y%m%d -d "1 day ago")*); do
However, this isn't the preferred way to iterate over the output of a command, since in theory the results could contain whitespace. See Bash FAQ 001 for the proper method.

How to find if there are new files in a directory on HDFS (Hadoop) every 4 min using shell script

I have a directory on HDFS e.g: /user/customers , in this directory I am dumping data file of customer every 3 min, I want to write a shell script which will check this folder and if a new file is available then that file data will be put in HBASE, I have figured out how I will put the data in HBASE. But I am very new to shell scripting, I want to know how can I get the new file name.
My hadoop command to put the data of file in HBASE is as follows:
hadoop jar /opt/mapr/hbase/hbase-0.94.12/hbase-0.94.12-mapr-1310.jar importtsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,cust:phno,cust:name,cust:memebershiptype /user/tablename customer.csv
Now the Idea is to replace this customer.csv file name with the filename which is recently dumped in the folder and then run this command.
So If am not wrong I will need a cron job to do the scheduling part. But I need the logic on how I can get the new file name in the above mentioned command first. Then my later part to learn is crontab for scheduling it for every 4 mins.
Please guide experts.
Try this script . it will give idea.basically first i am listing out the files and store them to customer_all_file.txt.in for loop pass the file name,store the file name to already processed files.difference command will find the new files and store them to need_to_processed files.its very simple go through it.
hadoop fs -ls hdfs://IPNamenode/user/customers/ | sed '1d;s/ */ /g' | cut -d\ -f8 | xargs -n 1 basename > /home/givepath/customer_all_file.txt
diff /home/givpath/customer_all_files.txt /home/givepath/customer_processedfiles.txt > /home/givepath/need_to_process.txt
for line in `awk '{ print $2 }' /home/givepath/need_to_process.txt`;
do
echo "$line"
hadoop jar /opt/mapr/hbase/hbase-0.94.12/hbase-0.94.12-mapr-1310.jar importtsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,cust:phno,cust:name,cust:memebershiptype /user/tablename $line
echo "$line" >> /home/givepath/customer_already_processedfiles.txt
done
Renaming part:
Does all your csv files have the same name as customer.csv? If yes, you need to rename them while uploading each file into hdfs.
Crontab part:
You can run your shell script every 4 minutes by using:
*/4 * * * * /your/shell/script/path
Add this line by typing crontab -e in terminal.

Removing Columns From ALL CSV Files in Specific Folder Then Outputting File With Date

I'm trying to automate some processes by using hazel app to move a file to a specific folder, execute a shell script on any csv in that folder, and then move it to another folder. Right now the part I'm working on is the shell script. I have been testing out the cut command in terminal on csvs but I'm not sure if its the same thing as a shell script since it doesnt seem to be working but what I have is:
cut -d',' -f2,12 test.csv > campaigns-12-31-13.csv
It looks for test.csv but I would like it to work with any csv, and it also exports it with the date 12-31-13 but I'm just trying to get it to export with whatever yesterdays date was.
How do I convert this to a shell script that will execute on any csv in the folder and so it adds the date for yesterday at the end of the filename?
You can try the following script:
#! /bin/bash
saveDir="saveCsv"
dd=$(date +%Y-%m-%d -d "yesterday")
for file in *.csv ; do
bname=$(basename "$file" .csv)
saveName="${saveDir}/${bname}-${dd}.csv"
cut -d',' -f2,12 "$file" > "$saveName"
done

Resources