Unable to get the desired result after loading data in Hive - hadoop

I am not able to get the genre column values displayed while I am loading the data as below:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale 1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
I have tried creating the table as DDl:-
create table movie(movie_id int, movie_name string , genre string) row format delimited fields terminated by '::';
And
create table movie(movie_id int, movie_name string , genre array<string>) row format delimited fields terminated by '::' collection items terminated by '|';
But still the genre column is coming as blank.
Please help in creating the table correctly so as to place the data and get the desired result.
Output is-
hive> select * from movie;
OK
1 Toy Story (1995)
2 Jumanji (1995)
3 Grumpier Old Men (1995)
4 Waiting to Exhale 1995)
5 Father of the Bride Part II (1995)
6 Heat (1995)
7 Sabrina (1995)

You have to use Multi delimiter SerDe
P.s.
You might want to consider loading genre as array<string> instead of string
create external table movie
(
movie_id int
,movie_name string
,genre array<string>
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties
(
"field.delim" = "::"
,"collection.delim" = "|"
)
;
select * from movie
;
+----------+------------------------------------+--------------------------------------+
| movie_id | movie_name | genre |
+----------+------------------------------------+--------------------------------------+
| 1 | Toy Story (1995) | ["Animation","Children's","Comedy"] |
| 2 | Jumanji (1995) | ["Adventure","Children's","Fantasy"] |
| 3 | Grumpier Old Men (1995) | ["Comedy","Romance"] |
| 4 | Waiting to Exhale 1995) | ["Comedy","Drama"] |
| 5 | Father of the Bride Part II (1995) | ["Comedy"] |
| 6 | Heat (1995) | ["Action","Crime","Thriller"] |
| 7 | Sabrina (1995) | ["Comedy","Romance"] |
+----------+------------------------------------+--------------------------------------+

Related

Vertica database added duplicate entry with same primary key

I am running a docker image of Vertica on windows. I have created a table in vertica with this schema (student_id is primary key)
dbadmin#d1f942c8c1e0(*)=> \d testschema.student;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
------------+---------+------------+-------------+------+---------+----------+-------------+-------------
testschema | student | student_id | int | 8 | | t | t |
testschema | student | name | varchar(20) | 20 | | f | f |
testschema | student | major | varchar(20) | 20 | | f | f |
(3 rows)
student_id is a primary key. I am testing loading data from csv file using copy command.
First I used insert - insert into testschema.student values (1,'Jack','Biology');
Then I created a csv file at /home/dbadmin/vertica_test directory -
vi student.csv
2,Kate,Sociology
3,Claire,English
4,Jack,Biology
5,Mike,Comp. Sci
Then I ran this command
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
I tested the result
select * from testschema.student - shows 5 rows
select * from students_rejected; - no rows
Then I creates another csv file with bad data at /home/dbadmin/vertica_test directory
vi student_bad.csv
bad_data_type_for_student_id,UnaddedStudent, UnaddedSubject
6,Cassey,Physical Education
I added data from bad csv file
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
Then I tested the output
select * from testschema.student - shows 6 rows <-- only one row got added. all ok
select * from students_rejected; - shows 1 row <-- bad row's entry is here. all ok
all looks good
Then I added the bad data again without the rejected data option
copy testschema.students from '/home/dbadmin/vertica_test/student_bad.csv' delimiter ',' ;
But now the entry with student id 6 got added again!!
student_id | name | major
------------+--------+--------------------
1 | Jack | Biology
2 | Kate | Sociology
3 | Claire | English
4 | Jack | Biology
5 | Mike | Comp. Sci
6 | Cassey | Physical Education <--
6 | Cassey | Physical Education <--
Shouldn't this have got rejected?
If you created your students with a command of this type:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id)
);
that is, without the explicit keyword ENABLED, then the primary key constraint is disabled. That is, you can happily insert duplicates, but will run into an error if you later want to join to the students table via the primary key column.
With the primary key constraint enabled ...
[...]
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
[...]
I think you get the desired effect.
The whole scenario:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
);
INSERT INTO students
SELECT 1,'Jack' ,'Biology'
UNION ALL SELECT 2,'Kate' ,'Sociology'
UNION ALL SELECT 3,'Claire','English'
UNION ALL SELECT 4,'Jack' ,'Biology'
UNION ALL SELECT 5,'Mike' ,'Comp. Sci'
UNION ALL SELECT 6,'Cassey','Physical Education'
;
-- out OUTPUT
-- out --------
-- out 6
COMMIT;
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out vsql:/home/gessnerm/._vfv.sql:4: ERROR 6745:
-- out Duplicate key values: 'student_id=6'
-- out -- violates constraint 'dbadmin.students.pk_students'
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
And the only reliable check seems to be the ANALYZE_CONSTRAINTS() call ...
ALTER TABLE students ALTER CONSTRAINT pk_students DISABLED;
-- out Time: First fetch (0 rows): 7.618 ms. All rows formatted: 7.632 ms
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out Time: First fetch (0 rows): 31.790 ms. All rows formatted: 31.791 ms
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
SELECT ANALYZE_CONSTRAINTS('students');
-- out Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values
-- out -------------+------------+--------------+-----------------+-----------------+---------------
-- out dbadmin | students | student_id | pk_students | PRIMARY | ('6')
-- out (1 row)

Convert raw query into laravel eloquent

I have this written and working as a raw SQL query, but I am trying to convert it to a more Laravel eloquent / query builder design instead of just a raw query.
My table structure like this:
Table One (Name model)
______________
| id | name |
|------------|
| 1 | bob |
| 2 | jane |
--------------
Table Two (Date Model)
_________________________________
| id | table_1_id | date |
|-------------------------------|
| 1 | 1 | 2000-01-01 |
| 2 | 1 | 2000-01-31 |
| 4 | 1 | 2000-02-28 |
| 5 | 1 | 2000-03-03 |
| 6 | 2 | 2000-01-03 |
| 7 | 2 | 2000-01-05 |
---------------------------------
I am returning only the the highest (most recent) dates from table 2 (Dates model) that match the user bob from table 1 (Name model).
For instance, in the example above, I return this from my query
2000-01-31
2000-02-28
2000-03-03
Here is what I am doing now (which works), but i'm just not sure how to use YEAR, MONTH and MAX with laravel.
DB::select(
DB::raw("
SELECT MAX(date) as max_date
FROM table_2
INNER JOIN table_1 ON table_1.id = table_2.table_1_id
WHERE table_1.name = 'bob'
GROUP BY YEAR(date), MONTH(date)
ORDER BY max_date DESC
")
);
Try this code if any problem then,
DB::table('table_1')->join('table_2', 'table_1.id','=','table_2.table_1_id')
->select(DB::raw('MAX(date) as max_date'),DB::raw('YEAR(date) year, MONTH(date) month'),'table_1.name')
->where('name','bob')
->groupBy('year','month')
->orderBy('max_date')
->get();
If any problem with above code then feel free to ask.

In SCD Type2 how to find latest record

I have table i have run the job in scdtype 2 load the data below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
then i have run the second run load the data given below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
--------------------
1 | abc | bang |
here no dates,flags,and run ids
how to find second updated record in this situtation
Thanks
I don't think you'll be able to distinguish between the updated record and the original record.
A Dimension table using Type 2 SCD requires additional columns that describes the period in which the record is valid (or current), exactly for this reason.
The solution is to ensure your dimension table has these columns (Typically ValidFrom and ValidTo dates or date/times, and sometimes an IsCurrent flag for good measure). Your ETL process would then populate these columns as part of making the Type 2 updates.

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

Hive: How do I join with a between dates condition?

I have table of items:
| id | dateTimeUTC | color |
+----+------------------+-------+
| 1 | 1/1/2001 1:11:11 | Red |
+----+------------------+-------+
| 2 | 2/2/2002 2:22:22 | Blue |
+----+------------------+-------+
It contains some events with a dateTime in it. I also have an events table:
| eventID | startDate | endDate |
+---------+-------------------+------------------+
| 1 | 1/1/2001 1:11:11 | 2/2/2002 2:22:22 |
+---------+-------------------+------------------+
| 2 | 3/3/2003 00:00:00 | 3/3/2003 1:11:11 |
+---------+-------------------+------------------+
I want to join the two, getting where the dateTimeUTC of the item table is in between the start and end date of the events table. Now, to do this in sql is pretty standard, but HQL not so much. Hive doesn't let you have anything but an "=" in the join clause. (Link to HIVE info here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins). Now, there was a question about a similar situation before here, but I found that it's been 4 years since then and have hoped there was a solution.
Any tips on how to make this happen?
I think you have string format for dates in tables , If yes use following ... Making date into standard format.
select * from items_x, items_date where UNIX_TIMESTAMP(dateTimeUTC,'dd/MM/yyyy HH:MM:SS') between UNIX_TIMESTAMP(startDate,'DD/MM/YYYY HH:MM:SS') and UNIX_TIMESTAMP(endDate,'DD/MM/YYYY HH:MM:SS') ;

Resources