I am trying to export data from excel into a hive table, while doing so, i have a column 'ABC' which has values like '1,2,3'.
I used the lateral view explode function but it does not does anything to my data.
Following is my code snippet :
CREATE TABLE table_name
(
id string,
brand string,
data_name string,
name string,
address string,
country string,
flag string,
sample_list array )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
;
LOAD DATA LOCAL INPATH 'location' INTO TABLE
table_name ;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 ["[1,2,3]"]
then i do:
select * from franchise_unsupress LATERAL VIEW explode(SEslist) SEslist as final_SE;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 [1,2,3]
I also tried:
select * from franchise_unsupress lateral view explode(split(SEslist,',')) SEslist AS final_SE ;
but got an error:
FAILED: ClassCastException org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
whereas, what i need is:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 1
19 1 ABC SQL ABC Cornstarch IN 1 2
19 1 ABC SQL ABC Cornstarch IN 1 3
Any help will be greatly appreciated! thank you
The problem is that array is recognized in a wrong way and loaded as a single element array ["[1,2,3]"]. It should be [1,2,3] or ["1","2","3"] (if it is array<string>)
When creating table, specify delimiter for collections:
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
I wanted to provide my answer.
The issue was with the input that was being provided. My input txt file had [] around the input value. They had to be removed and it worked.
Related
i have a File with the data in the following format:
no tablehead
[date] colum1=xy colum2=abc colum4=xyz
[date] colum1=zz colum3=234 colum4=abc
The problem is, that not every dataset has all of the variables and they´re not seperated by like 2 tabs in that case. Therefore i need to read the file somehow with the columname in front of every datapoint. Im using a oracle database, but also can use SAS.
Thanks in advance
Just use named input mode.
data want;
length date $10 column1-column4 $20;
input date (column1-column4) (=);
cards;
[date] column1=xy column2=abc column4=xyz
[date] column1=zz column3=234 column4=abc
;
Results:
Obs date column1 column2 column3 column4
1 [date] xy abc xyz
2 [date] zz 234 abc
I don't get how to sort or order by column that contains values as following
abc/aa
aa
bb/cba
bb/aa
cc
Now I need the values in the column to be displayed as the values containing slash to be displayed last and those that don't have slash to be displayed at first.
Required Output
aa
cc
cba
abc/aa
bb/aa
bb/cba
Please guide me
Thanks in Advance
You don't provide your query, but the form will be
DECLARE #Tbl TABLE (CharVal VARCHAR(50))
INSERT INTO #Tbl VALUES ('abc/aa'),('aa'),('bb/cba'),('bb/aa'),('cc')
SELECT CharVal FROM #Tbl
ORDER BY CASE WHEN PATINDEX('%/%',CharVal) > 0 THEN 1 ELSE 0 END, CharVal
Output:
CharVal
aa
cc
abc/aa
bb/aa
bb/cba
EDIT: Corrected 0/1 reversal in the CASE statement that resulted in incorrect sort order, thanks #Aaron_Bertrand! Also added populating a temp table with the data and showing the output.
I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.
I have a single file with a structure like:
A 1 2 3
A 4 5 6
A 5 8 12
B abc cde
B and fae
B bsd oio
C 1
C 2
C 3
and would like to load the data in 3 simple tables (A (int int int), B(string string) C(int)).
Is it possible and how?
It's also fine for me, if A(string int int int) etc. with the first column of the file to be included in the table.
I'd go with option 1 as Praveen suggests. I'd create an external table of only a string, and use the FROM ( ... ) syntax to insert into multiple tables at once. I think something like the following would work
create external table source_table( line string )
stored as textfile
location '/myfile';
from ( select split( line , " ") as col_array from source_table ) cols
insert overwrite table A select col_array[1], col_array[2], col_array[3] where col_array[0] = 'A'
insert overwrite table B select col_array[1], col_array[2] where col_array[0] = 'B'
insert overwrite table C select col_array[1] where col_array[0] = 'C';
Option 1) Map the entire data to a Hive table and then use the insert overwrite table .... option to map the appropriate data to the target tables.
Option 2) Develop a MR program to split the file into multiple files and then do the mapping of the files to the target tables in Hive.
A weird request maybe but. My boss wants me to create an admin version of a page we have that displays data from an oracle query in a table.
The admin page, instead of displaying the data (query returns 1 row), needs to return the table name and column name
Ex: Instead of:
Name Initial
==================
Bob A
I want:
Name Initial
============================
Users.FirstName Users.MiddleInitial
I realize I can do this in code but would rather just modify the query to return the data I want so I can leave the report generation code mostly alone.
I don't want to do it in a stored procedure.
So when I spit out the data in the report using something like:
blah blah = MyDataRow("FirstName")
I can leave that as is but instead of it displaying "BOB" it would display "Users.FirstName"
And I want to do the query using select * if possible instead of listing all the columns
So for each of the columns I am querying in the * , I want to get (instead of the column value) the tablename.ColumnName or tablename|columnName
hope you are following- I am confusing myself...
pseudo:
select tablename + '.' + Columnname as WhateverTheColumnNameIs
from Table1
left join Table2 on whatever...
Join Table_Names on blah blah
Whew- after writing all this I think I will just do it on the code side.
But if you are up for it maybe a fun challenge
Oracle does not provide an authentic way(there is no pseudocolumn) to get the column name of a table as a result of a query against that table. But you might consider these two approaches:
Extract column name from an xmltype, formed by passing cursor expression(your query) in the xmltable() function:
-- your table
with t1(first_name, middle_name) as(
select 1,2 from dual
), -- your query
t2 as(
select * -- col1 as "t1.col1"
--, col2 as "t1.col2"
--, col3 as "t1.col3"
from hr.t1
)
select *
from ( select q.object_value.getrootelement() as col_name
, rownum as rn
from xmltable('//*'
passing xmltype(cursor(select * from t2 where rownum = 1))
) q
where q.object_value.getrootelement() not in ('ROWSET', 'ROW')
)
pivot(
max(col_name) for rn in (1 as "name", 2 as "initial")
)
Result:
name initial
--------------- ---------------
FIRST_NAME MIDDLE_NAME
Note: In order for column names to be prefixed with table name, you need to list them
explicitly in the select list of a query and supply an alias, manually.
PL/SQL approach. Starting from Oracle 11g you could use dbms_sql() package and describe_columns() procedure specifically to get the name of columns in the cursor(your select).
This might be what you are looking for, try selecting from system views USER_TAB_COLS or ALL_TAB_COLS.