Pig overwrite data in hive using LOAD - hadoop

I am new to Pig and hive ,I need to load the data from csv file stored on hdfs to the hive table using pig load-store.
for which I am using
load_resource_csv = LOAD '/user/hadoop/emp.csv' USING PigStorage(',')
AS
(dates:chararray,
shipnode_key:chararray,
delivery_method:chararray,
);
STORE load_resource_csv
INTO 'employee'
USING org.apache.hive.hcatalog.pig.HCatStorer();
I need to overwrite the data in the hive table every time I run the Pig script . ow can I do that ?

use fs shell command: fs -rm -f -r /path/to/dir:
load_resource_csv = LOAD '/user/cloudera/newfile' USING PigStorage(',')
AS
(name:chararray,
skill:chararray
);
fs -rm -r -f /user/hive/warehouse/stack/
STORE load_resource_csv INTO '/user/hive/warehouse/stack' USING PigStorage(',');
-------------- BEFORE ---------------------------
$ hadoop fs -ls /user/hive/warehouse/stack/
-rwxrwxrwx 1 cloudera supergroup 22 2016-08-05 18:31 /user/hive/warehouse/stack/000000_0
hive> select * from stack;
OK
bigDataLearner hadoop
$ hadoop fs -cat /user/cloudera/newfile
bigDataLearner,spark
-------------- AFTER -------------------
$ hadoop fs -ls /user/hive/warehouse/stack
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2016-08-05 18:56 /user/hive/warehouse/stack/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 21 2016-08-05 18:56 /user/hive/warehouse/stack/part-m-00000
$ hadoop fs -cat /user/hive/warehouse/stack/*
bigDataLearner,spark
hive> select * from stack;
OK
bigDataLearner spark
Time taken: 0.183 seconds, Fetched: 1 row(s)

Related

is it safe to remove the /tmp/hive/hive folder?

is it safe to remove the /tmp/hive/hive folder? ( from hdfs )
as ( from user hdfs )
hdfs dfs -rm -r /tmp/hive/hive
the reason for that because under /tmp/hive/hive we have thousand of files and we cant delete them
hdfs dfs -ls /tmp/hive/
Found 7 items
drwx------ - admin hdfs 0 2019-03-05 12:00 /tmp/hive/admin
drwx------ - drt hdfs 0 2019-06-16 14:02 /tmp/hive/drt
drwx------ - ambari-qa hdfs 0 2019-06-16 15:11 /tmp/hive/ambari-qa
drwx------ - anonymous hdfs 0 2019-06-16 08:57 /tmp/hive/anonymous
drwx------ - hdfs hdfs 0 2019-06-13 08:42 /tmp/hive/hdfs
drwx------ - hive hdfs 0 2019-06-13 10:58 /tmp/hive/hive
drwx------ - root hdfs 0 2018-07-17 23:37 /tmp/hive/root
what we did until now - as the following is to remove the files that are older then 10 days ,
but because there are so many files then files not deleted at all
hdfs dfs -ls /tmp/hive/hive | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Run a bash command and split the output by line not by space

I want to run the following command and get the output in the variable splited in an array by line not by space:
files=$( hdfs dfs -ls -R $hdfsDir)
So the output I get is the following: echo $files
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02 -rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148 -rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
When I do a for in $files in stead of getting the full line on each., I get the column in stead of the line. It prints like the following:
drwxr-xr-x
-
pepeuser
supergroup
and what I need on the for to print like this:
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02
-rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148
-rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
If you have bash 4, you can use readarray:
readarray -t files < <(hdfs dfs -ls -R "$hdfsDir")
Otherwise, use read -a to read into an array. IFS=$'\n' sets the field separator to newlines and -d '' tells it to keep reading until it hits a NUL character: effectively, that means it'll read to EOF.
IFS=$'\n' read -d '' -r -a files < <(hdfs dfs -ls -R "$hdfsDir")
You can verify that the array is populated correctly with something like:
printf '[%s]\n' "${files[#]}"
And can loop over the array with:
for file in "${files[#]}"; do
echo "$file"
done

Merge multiple files recursively in HDFS

My folder path structure in HDFS is something like this:
/data/topicname/year=2017/month=02/day=28/hour=00
/data/topicname/year=2017/month=02/day=28/hour=01
/data/topicname/year=2017/month=02/day=28/hour=02
/data/topicname/year=2017/month=02/day=28/hour=03
Inside these paths I have many small size json files. I am writing a shell script which can merge all files present inside all these individual directories into a single individual filename depending on path.
Example:
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=00 into one merged file full_2017_02_28_00.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=01 into one merged file full_2017_02_28_01.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=02 into one merged file full_2017_02_28_02.json and so on.
Keeping the file name in the above said pattern is secondary job which I will try to achieve. Currently I can hardcode the filenames.
But, recursive concatenation inside directory path structure is not happening.
So far, I have tried below:
hadoop fs -cat /data/topicname/year=2017/* | hadoop fs -put - /merged/test1.json
Error:-
cat: `/data/topicname/year=2017/month=02/day=28/hour=00': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=01': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=02': Is a directory
Recursive cat is not happening in above try
hadoop fs -ls /data/topicname/year=2017/month=02 | find /data/topicname/year=2017/month=02/day=28 -name '*.json' -exec cat {} \; > output.json
Error:-
find: ‘/data/topicname/year=2017/month=02/day=28’: No such file or directory
It is doing find in local FS instead of HDFS in above attempt
for i in `hadoop fs -ls -R /data/topicname/year=2017/ | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - /merged/output.json`; done
Error:-
cannot write output to stream message is repeated multiple times
file /merged/output.json is repeated a few times
How is this achievable? I do not want to use Spark.
Use -appendToFile:
for file in `hdfs dfs -ls -R /src_folder | awk '$2!="-" {print $8}'`; do hdfs dfs -cat $file | hdfs dfs -appendToFile - /target_folder/filename;done
Time taken will be dependent on the number and size of files as the process is sequential.
I was able to achieve my goal with below script:
#!/bin/bash
for k in 01 02 03 04 05 06 07 08 09 10 11 12
do
for j in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
do
for i in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
do
hadoop fs -cat /data/topicname/year=2017/month=$k/day=$j/hour=$i/* | hadoop fs -put - /merged/TEST1/2017"_"$k"_"$j"_"$i.json
hadoop fs -du -s /merged/TEST1/2017"_"$k"_"$j"_"$i.json > /home/test/sizetest.txt
x=`awk '{ print $1 }' /home/test/sizetest.txt`
echo $x
if [ $x -eq 0 ]
then
hadoop fs -rm /merged/TEST1/2017"_"$k"_"$j"_"$i.json
else
echo "MERGE DONE!!! All files generated at hour $i of $j-$k-2017 merged into one"
echo "DELETED 0 SIZED FILES!!!!"
fi
done
done
done
rm -f /home/test/sizetest.txt
hadoop fs -rm -r /data/topicname

how to properly import csv data set using kite-dataset partitioned schema?

I'm working with the publicly-available csv dataset from MovieLens
I have created a partitioned dataset for the ratings.csv:
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet
Here is my year-month.json:
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
} ]
Here is my csv import command:
mkite-dataset csv-import ratings.csv ratings
After the import finished, I ran this command to see whether year and month partitions where in fact created:
hadoop fs -ls /user/hive/warehouse/ratings/
What I've noticed, is that only a single year partition was created, and inside of it one a single month partition was created:
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01
What is the proper way to conduct such partitioned import, which would resulted in all years and all month partitions being created?
Add three zeros in the end for the timestamp.
Use the below shell script to do it
#!/bin/bash
# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv
# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv
enter code here
# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv
Even I had this problem, and it was resolved after adding 3 zeros.

specify time range using unix grep

Hi I have few files in hdfs , now I have to extract the files in specific range . How can I do that using unix grep command?
My hdfs looks like this:
-rw-rw-r-- 3 pscore hdpdevs 94461 2014-12-10 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-10_02-07-12-0
-rw-rw-r-- 3 pscore hdpdevs 974422 2014-12-11 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-11_02-07-10-0
-rw-rw-r-- 3 pscore hdpdevs 32854 2014-12-11 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-11_02-07-16-0
-rw-rw-r-- 3 pscore hdpdevs 1936753 2014-12-12 02:07 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-12_02-06-04-0
-rw-rw-r-- 3 pscore hdpdevs 79365 2014-12-12 02:07 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-12_02-06-11-0
I want to extract the files from 2014-12-11 09:00 to 2014-12-12 09:00.
I tried using hadoop fs -ls /dabc | sed -n '/2014-12-11 09:00/ , /2014-12-12 09:00/p' but that does'nt work . Any help? I want to use grep command for this
awk '$6FS$7 >= "2014-12-11 09:00" && $6FS$7 <= "2014-12-12 09:00"'
Can I do string comparison in awk?

Resources