Spark write data into partitioned Hive table very slow - hadoop

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I am trying to write it by:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?

When you create the table explicitly then that DDL defines the table.
Normally text file is the default in Hive but it could have been changed in your environment.
Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

Related

Hive load multiple partitioned HDFS file to table

I have some twice-partitioned files in HDFS with the following structure:
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210506/coeff=1.0/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.5/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=0.75/data.parquet
/user/hive/warehouse/datascience.db/simulations/datekey=20210507/coeff=1.0/data.parquet
and would like to load these into a hive table as elegantly as possible. I know the typical solution for something like this is to load all the data into a non-partitioned table first, then transfer all the data to final table using dynamic partitioning as mentioned here
However, my files don't have the datekey and coeff values in the actual data, it's only in the filename since that's how it's partitioned. So how would I keep track of these values when I load them into the intermediate table?
One workaround would be to do a separate load data inpath query for each coeff value and datekey. This would not need the intermediate table, but would be cumbersome and probably not optimal.
Are there any better ways for how to do this?
Typical solution is to build external partitioned table on top of hdfs directory:
create external table table_name (
column1 datatype,
column2 datatype,
...
columnN datatype
)
partitioned by (datekey int,
coeff float)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/datascience.db/simulations'
After that, recover all partitions, this command will scan table location and create partitions in Hive metadata:
MSCK REPAIR TABLE table_name;
Now you can query table columns along with partiion columns and do whatever you want with it: use as is, or load into another table using insert .. select .. , etc:
select
column1,
column2,
...
columnN,
--partition columns
datekey,
coeff
from table_name
where datekey = 20210506
;

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Insert partitioned data into partitioned hive table

I have stored the data in hdfs using Pig Multistorage with the column id.
So data stored as
/output/1/part-0000
/output/2/
/output/3/
Now I have created a partitioned table in hive and I want to load the data from /output folder into this partitioned table. Is there any way to achieve this?
First you create a temp hive table where you load all the data from pig output.
Then You load to your actual partitioned hive table from temp table.
Something like below:
FROM emp_external temp INSERT OVERWRITE TABLE emp_partition PARTITION(country) SELECT temp.id,temp.name,temp.dept,temp.sal,temp.country;
Else you can explore Hcatlog for this case.
not sure if you are looking to insert the data in the outputfolder (created from pig) to an existing table or loading the data in the output folder in to a new hive partitioned table.
If you want to load the data in to new hive table, you can create a new partitioned table pointing to the output folder
If you are looking to load the data into an existing hive table, then you can either create a temp table as #Aman mentioed and do a insert in to the destination table
or
You can just move/copy the files in the hdfs from output/ to hive table location.
Hope this helps
Assign a Hive schema to pig output location with partitioned columns (Alter table Add Partition) as column id. Now both are hive tables and you can use where clause over partitioned column to move over the data.

Hive, create table ___ like ___ stored as ___

I have a table in hive stored as text files. I want to move all the data into another table with the same schema but stored as sequence files.
How do I create the second table? I wanted to use the hive create table like command but it doesn't support as sequencefile
hive> create table test_sq like test_t stored as sequencefile;
FAILED: ParseException line 1:33 missing EOF at 'stored' near 'test_t'
I am looking for a programmatic way so that I can replicate the same process for more tables.
CREATE TABLE test_t LIKE test_sq;
It just copies the source table definition.The new table contains no rows. As you said you have to move all the data. In this case above query is not suitable;
try this,
CREATE TABLE test_sq row format delimited fields terminated by '|' STORED AS sequencefile AS select * from test_t;
Target cannot be partitioned table.
Target cannot be external table.
It copies the structure as well as the data
Note - if you don't want to give row format delimited then remove from query. You can give where clause also in query to copy selected rows;
Try using create + insert together.
Use the normal DDL statement to create the table.
CREATE TABLE test2 (a INT) STORED AS SEQUENCEFILE
then use
INSERT INTO test2 AS SELECT * FROM test;
test is the table with Textfile as data format and 'test2' is the table with SEQUENCEFILE data format.

Pig: get data from hive table and add partition as column

I have a partitioned Hive table that i want to load in a Pig script and would like to add partition as column also.
How can I do that?
Table definition in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS transactions
(
column1 string,
column2 string
)
PARTITIONED BY (datestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/path';
Pig script:
%default INPUT_PATH '/path'
A = LOAD '$INPUT_PATH'
USING PigStorage('|')
AS (
column1:chararray,
column2:chararray,
datestamp:chararray
);
The datestamp column is not populated. Why is it so?
I am sorry I didn't get the part which says add partition as column also. Once created, partition keys behave like regular columns. What exactly do you need?
And you are loading the data directly from a given HDFS location, not as a Hive table. If you intend to use Pig to load/store data from/into a Hive table you should use HCatalog.
For example :
A = LOAD 'transactions' USING org.apache.hcatalog.pig.HCatLoader();

Resources