how to properly import csv data set using kite-dataset partitioned schema? - hadoop

I'm working with the publicly-available csv dataset from MovieLens
I have created a partitioned dataset for the ratings.csv:
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet
Here is my year-month.json:
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
} ]
Here is my csv import command:
mkite-dataset csv-import ratings.csv ratings
After the import finished, I ran this command to see whether year and month partitions where in fact created:
hadoop fs -ls /user/hive/warehouse/ratings/
What I've noticed, is that only a single year partition was created, and inside of it one a single month partition was created:
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01
What is the proper way to conduct such partitioned import, which would resulted in all years and all month partitions being created?

Add three zeros in the end for the timestamp.
Use the below shell script to do it
#!/bin/bash
# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv
# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv
enter code here
# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv
Even I had this problem, and it was resolved after adding 3 zeros.

Related

is it safe to remove the /tmp/hive/hive folder?

is it safe to remove the /tmp/hive/hive folder? ( from hdfs )
as ( from user hdfs )
hdfs dfs -rm -r /tmp/hive/hive
the reason for that because under /tmp/hive/hive we have thousand of files and we cant delete them
hdfs dfs -ls /tmp/hive/
Found 7 items
drwx------ - admin hdfs 0 2019-03-05 12:00 /tmp/hive/admin
drwx------ - drt hdfs 0 2019-06-16 14:02 /tmp/hive/drt
drwx------ - ambari-qa hdfs 0 2019-06-16 15:11 /tmp/hive/ambari-qa
drwx------ - anonymous hdfs 0 2019-06-16 08:57 /tmp/hive/anonymous
drwx------ - hdfs hdfs 0 2019-06-13 08:42 /tmp/hive/hdfs
drwx------ - hive hdfs 0 2019-06-13 10:58 /tmp/hive/hive
drwx------ - root hdfs 0 2018-07-17 23:37 /tmp/hive/root
what we did until now - as the following is to remove the files that are older then 10 days ,
but because there are so many files then files not deleted at all
hdfs dfs -ls /tmp/hive/hive | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Unix script for checking logs for last 10 days

I have a log table which is maintained for a single day and the data from the table is only present for one day.However, the logs for it is present in the unix directory.
My requirement is to check the logs for the last 10 days and find me the count of records got loaded.
In the log file the pattern is something like this( fastload log of teradata).
**** 13:16:49 END LOADING COMPLETE
Total Records Read = 443303
Total Error Table 1 = 0 ---- Table has been dropped
Total Error Table 2 = 0 ---- Table has been dropped
Total Inserts Applied = 443303
Total Duplicate Rows = 0
I want to the script to be parametrized( parameter will be stage table name) which find the records inserted into table and error tables for the last 10 days.
Is this possible? Can anyone help me build the unix script for this?
There are many logs in the logs directory. what if a want to check only for the below:
bash-3.2$ ls -ltr 2018041*S_EVT_ACT_FLD*
-rw-rw----+ 1 edwops abgrp 52610 Apr 10 17:37 20180410173658_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52576 Apr 11 18:12 20180411181205_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52646 Apr 13 18:04 20180413180422_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52539 Apr 14 16:16 20180414161603_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52538 Apr 15 14:15 20180415141523_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52576 Apr 16 15:38 20180416153808_S_EVT_ACT_FLD.log
Thanks.
find . -ctime -10 -type f -print|xargs awk -F= '/Total Records Read/ {print $2}'|paste -sd+| bc
find . -ctime -10 -type f -print get the filenames of files 10 days or younger in current working directory. To run on a different directory replace . with the path
awk -F= '/Total Records Read/ {print $2}' using = as a field seperator filter out the second half of any line containing the key phrase
Total Records Read
paste -sd+ add a plus sign
bc evaluate the stream of numbers and operators into a single answer
I could not use find. because the system is Solaris, find doesn't have maxdepth future. I use case to create a FILTER2 and use it to
ls -l --time-style=long-iso FOLDER | grep -E $FILTER.
but I know it's not a good way.
LOCAL_DAY=`date "+%d"`
LOCAL_MONTH=`date "+%Y-%m"`
LASTTENDAYE_MONTH=`date --date='10 days ago' "+%Y-%m"`
case $LOCAL_DAY in
0*)
FILTER2="$LASTTENDAY_MONTH-[2-3][0-9]|$LOCAL_MONTH";;
1*)
FILTER2="$LOCAL_MONTH-0[0-9]|$LOCAL_MONTH-1[0-9]";;
2*)
FILTER2="$LOCAL_MONTH-1[0-9]|$LOCAL_MONTH-2[0-9]";;
esac

Run a bash command and split the output by line not by space

I want to run the following command and get the output in the variable splited in an array by line not by space:
files=$( hdfs dfs -ls -R $hdfsDir)
So the output I get is the following: echo $files
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02 -rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148 -rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
When I do a for in $files in stead of getting the full line on each., I get the column in stead of the line. It prints like the following:
drwxr-xr-x
-
pepeuser
supergroup
and what I need on the for to print like this:
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02
-rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148
-rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
If you have bash 4, you can use readarray:
readarray -t files < <(hdfs dfs -ls -R "$hdfsDir")
Otherwise, use read -a to read into an array. IFS=$'\n' sets the field separator to newlines and -d '' tells it to keep reading until it hits a NUL character: effectively, that means it'll read to EOF.
IFS=$'\n' read -d '' -r -a files < <(hdfs dfs -ls -R "$hdfsDir")
You can verify that the array is populated correctly with something like:
printf '[%s]\n' "${files[#]}"
And can loop over the array with:
for file in "${files[#]}"; do
echo "$file"
done

Pig overwrite data in hive using LOAD

I am new to Pig and hive ,I need to load the data from csv file stored on hdfs to the hive table using pig load-store.
for which I am using
load_resource_csv = LOAD '/user/hadoop/emp.csv' USING PigStorage(',')
AS
(dates:chararray,
shipnode_key:chararray,
delivery_method:chararray,
);
STORE load_resource_csv
INTO 'employee'
USING org.apache.hive.hcatalog.pig.HCatStorer();
I need to overwrite the data in the hive table every time I run the Pig script . ow can I do that ?
use fs shell command: fs -rm -f -r /path/to/dir:
load_resource_csv = LOAD '/user/cloudera/newfile' USING PigStorage(',')
AS
(name:chararray,
skill:chararray
);
fs -rm -r -f /user/hive/warehouse/stack/
STORE load_resource_csv INTO '/user/hive/warehouse/stack' USING PigStorage(',');
-------------- BEFORE ---------------------------
$ hadoop fs -ls /user/hive/warehouse/stack/
-rwxrwxrwx 1 cloudera supergroup 22 2016-08-05 18:31 /user/hive/warehouse/stack/000000_0
hive> select * from stack;
OK
bigDataLearner hadoop
$ hadoop fs -cat /user/cloudera/newfile
bigDataLearner,spark
-------------- AFTER -------------------
$ hadoop fs -ls /user/hive/warehouse/stack
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2016-08-05 18:56 /user/hive/warehouse/stack/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 21 2016-08-05 18:56 /user/hive/warehouse/stack/part-m-00000
$ hadoop fs -cat /user/hive/warehouse/stack/*
bigDataLearner,spark
hive> select * from stack;
OK
bigDataLearner spark
Time taken: 0.183 seconds, Fetched: 1 row(s)

Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

I am running this command --
sudo -u hdfs hadoop fs -du -h /user | sort -nr
and the output is not sorted in terms of gigs, Terabytes,gb
I found this command -
hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
but did not seem to work.
is there a way or a command line flag i can use to make it sort and output should look like--
123T /xyz
124T /xyd
126T /vat
127G /ayf
123G /atd
Please help
regards
Mayur
hdfs dfs -du -h <PATH> | awk '{print $1$2,$3}' | sort -hr
Short explanation:
The hdfs command gets the input data.
The awk only prints the first three fields with a comma in between the 2nd and 3rd.
The -h of sort compares human readable numbers like 2K or 4G, while the -r reverses the sort order.
hdfs dfs -du -h <PATH> | sed 's/ //' | sort -hr
sed will strip out the space between the number and the unit, after which sort will be able to understand it.
This is a rather old question, but stumbled across it while trying to do the same thing. As you were providing the -h (human readable flag) it was converting the sizes to different units to make it easier for a human to read. By leaving that flag off we get the aggregate summary of file lengths (in bytes).
sudo -u hdfs hadoop fs -du -s '/*' | sort -nr
Not as easy to read but means you can sort it correctly.
See https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#du for more details.
I would use some small skript. It's primitive but reliable
#!/bin/bash
PATH_TO_FOLDER="$1"
hdfs dfs -du -h $PATH_TO_FOLDER > /tmp/output
cat /tmp/output | awk '$2 ~ /^[0-9]+$/ {print $1,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "K" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "M" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "G" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "T" ) print $1,$2,$NF}' | sort -k1,1n
rm /tmp/output
Try this to sort hdfs dfs -ls -h /path sort -r -n -k 5
-rw-r--r-- 3 admin admin 108.5 M 2016-05-05 17:23 /user/admin/2008.csv.bz2
-rw-r--r-- 3 admin admin 3.1 M 2016-05-17 16:19 /user/admin/warand_peace.txt
Found 11 items
drwxr-xr-x - admin admin 0 2016-05-16 17:34 /user/admin/oozie-oozi
drwxr-xr-x - admin admin 0 2016-05-16 16:35 /user/admin/Jars
drwxr-xr-x - admin admin 0 2016-05-12 05:30 /user/admin/.Trash
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_21
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_20
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_19
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_18
drwx------ - admin admin 0 2016-05-16 17:38 /user/admin/.staging

Resources