select all but few columns in impala - hadoop

Is there a way to replicate the below in impala?
SET hive.support.quoted.identifiers=none
INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal'
Basically I have a table in hive as text with 1000 fields, and I need a select that drops off the field A. The above works for Hive but now impala, how can I do this in impala without specifying all other 999 fields directly?

Related

Hive load multiple partitioned HDFS file to table

I have some twice-partitioned files in HDFS with the following structure:
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet
and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here
However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?
One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.
Are there any better ways for how to do this?
Typical solution is to build external partitioned table on top of hdfs directory:
create external table table_name (
column1 datatype,
column2 datatype,
...
columnN datatype
)
partitioned by (datekey int,
coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'
After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:
MSCK REPAIR TABLE table_name;
Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:
select
column1,
column2,
...
columnN,
--partition columns
datekey,
coeff
from table_name
where datekey = 20210506
;

Add conditional field to table in Hive or Impala

I have a massive table stored as parquet and I need to add columns based on conditions.
Is there a way to do that without having to recreate a new table in Hive or Impala?
Something like this?
ALTER TABLE xyz
ADD COLUMN flag AS (CASE WHEN ... END)
Thank you
I don't believe that Hive or Impala support computed columns. This type of calculation is often done using a view:
CREATE VIEW v_xyz AS
SELECT xyz.*,
(CASE WHEN ... END) as flag
FROM xyz;
You can then update the view at any time to adjust the logic or add new columns.

Spark write data into partitioned Hive table very slow

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I am trying to write it by:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?
When you create the table explicitly then that DDL defines the table.
Normally text file is the default in Hive but it could have been changed in your environment.
Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

Import selected data from oracle db to S3 using sqoop and create hive table script on AWS EMR with selected data

I am new to big data technologies. I am working on below requirement and need help to make my work simpler.
Suppose i have 2 tables in oracle db and each table has 500 columns in it. my task is to move the selected columns data from both the tables (by join query) to AWS S3 and populate the data in hive table on AWS-EMR.
Currently to full-fill my requirement i follow below steps.
Creating external hive table on AWS-EMR with the selected columns. I know the column names but to identify the column data type for hive, i am going to oracle database tables and identifying the type of column in oracle and creating the hive script.
Once table is created, i am writing sqoop import command with selected query data and giving directory directory to S3.
Repair the table from the S3 data.
To explain in details,
Suppose T1 and T2 are two tables, T1 has 500 columns from T1_C1 to T1_C500 with various data type (Number, Varchar, Date) etc. Similarly T2 also has 500 columns from T2_C1 to T2_C500.
Now suppose i want to move some columns for ex: T1_C23,T1_C230,T1_C239,T2_C236,T1_C234,T2_C223 to S3 and create the hive table for selected columns and to know the data type i need to look into T1 and T2 table schema.
Is there any simpler way to achieve this ?
In above mentioned steps, First step takes lot of manual time because i need to look at the table schema and get the data type of selected columns and then create hive table.
To brief about work environment.
Services running on Data Center:
Oracle DB
Sqoop on linux machine.
sqoop talks to oracle db and configured to push the data on S3.
Services running on AWS:
S3
AWS EMR hive
hive talks to S3 and uses S3 data to repair the table.
1)
to ease your hive table generation, you may use Oracle dictionary
SELECT t.column_name || ' ' ||
decode(t.data_type, 'VARCHAR2', 'VARCHAR', 'NUMBER', 'DOUPLE') ||
' COMMENT '||cc.comments||',',
t.*
FROM user_tab_columns t
LEFT JOIN user_col_comments cc
ON cc.table_name = t.table_name
AND cc.column_name = t.column_name
WHERE t.table_name in ('T1','T2')
ORDER BY t.table_name, t.COLUMN_id;
First column of this data set will be your column list for CREATE TABLE command.
You need to modify DECODE to correctly trunslate Oracle types to Hive types
2)
As I remember, sqoop easily export table, so you may create view in Oracle to hide join query inside and export this view by sqoop:
CREATE OR REPLACE VIEW V_T1_T2 AS
SELECT * FROM T1 JOIN T2 ON ...;

Create new hive table from existing external portioned table

I have a external partitioned table with almost 500 partitions. I am trying to create another external table with same properties as of the old table. Then i want to copy all the partitions from my old table to the newly created table. below is my create table query. My old table is stored as TEXTFILE and i want to save the new one as ORC file.
'add jar json_jarfile;
CREATE EXTERNAL TABLE new_table_orc (col1,col2,col3...col27)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (....)
STORED AS orc
LOCATION 'path';'
And after creation of this table. i am using the below query to insert the partitions from old table to new one.i only want to copy few columns from original table to new table
'set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE new_table_orc PARTITION (year,month,day) SELECT col2,col3,col6,year,month,day FROM old_table;
ALTER TABLE new_table_orc RECOVER PARTITIONS;'
i am getting below error.
'FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target table because column number/types are different 'day': Table insclause-0 has 27 columns, but query has 6 columns.'
Any suggestions?
Your query has to match the number and type of columns in your new table. You have created your new table with 27 regular columns and 3 partition columns, but your query only select six columns.
If you really only care about those six columns, then modify the new table to have only those columns. If you do want all columns, then modify your select statement to select all of those columns.
You also will not need the "recover partitions" statement. When you insert into a table with dynamic partitions, it will create those partitions both in the filesystem and in the metastore.

Resources