Databricks, What is the default location when using "writeStream"? - location

Assume that I want to write a table by writeStream. Where is the default location on DBFS where the table is saved?
Sample code:
spark.table("TEMP_SILVER").writeStream
.option("checkpointLocation", "dbfs:/user/AAA#gmail.com")
.trigger(availableNow=True)
.table("silver")

If you specify just silver in table method, then you should look for you table in following location:
dbfs:/user/hive/warehouse/silver

Related

data deleted from hdfs after using hive load command [duplicate]

When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.

Unable to understand significance of external keyword in hive

I have a few doubts which I need to be clarified:
If I create a table without the "External" keyword, but specify "location", will it be an external or internal table in the hive?
If I use the "external" keyword with a table name but do not specify the 'location', it will be saved to hive/warehouse location which is the default storage. In this case, will it be an external table?
Overall, I want to understand, what makes a table external, the keyword "External" or specifying the 'location'.
Any help will be appreciated.
If I create a table without the "External" keyword, but specify
"location", will it be an external or internal table in the hive?
It will be MANAGED table (EXTERNAL=False). You can check it using DESCRIBE FORMATTED tablename;
If I use the "external" keyword with a table name but do not specify
the 'location', it will be saved to hive/warehouse location which is
the default storage. In this case, will it be an external table?
Yes, it will be EXTERNAL table.
what makes a table external, the keyword "External" or specifying the
'location'
Only EXTERNAL property / keyword in CREATE TABLE makes EXTERNAL TABLE, not location. EXTERNAL table property is not about location initially. EXTERNAL or not EXTERNAL(MANAGED) defines how DROP TABLE behaves: for EXTERNAL table DROP TABLE will not remove table location, only table metadata will be deleted. For managed table DROP TABLE will remove it's location with all data files as well as metadata. There are also differences in features supported for managed and external tables
In earlier versions of Hive it was no constraints on where the managed or external tables should be located and it was possible to create MANAGED table outside hive.metastore.warehouse.dir. If LOCATION is not specified, hive will use the value of hive.metastore.warehouse.dir for both managed and external tables. And you can create both managed and external tables on top of the same location: https://stackoverflow.com/a/54038932/2700344.
See also https://stackoverflow.com/a/56957960/2700344 and https://stackoverflow.com/a/67073849/2700344

creating a table with hive based on a parquet file

i have a parquet file stored in hdfs called small in path:
/user/s/file.parquet
and want to create a table in hive containing it's content.
the schema of the file is very complected and i want hive to automatically import the schema from the file.
i want to do something like this:
CREATE EXTERNAL TABLE tableName
STORED AS PARQUET
LOCATION 'file/path'
is this possible?
thank you for your help.
Unfortunately it's not possible to create external table on a single file in Hive, just for directories. If /user/s/file.parquet is the only file in the directory you can indicate location as /user/s/ and Hive will catch up your file.

External hive table as parquet file returns NULL when queried

I created a .parquet file by using map reduce job. Now I want to create an external table on top of this file. Here is the command:
CREATE EXTERNAL TABLE testparquet (
NAME STRING,
AGE INT
)
STORED AS PARQUET
LOCATION 'file location'
The table is created successfully but when I query the table using simple SELECT * , I get data as NULL for all fields. The version of hive is 0.13.
Is there anything that I am missing?
When using external files, you need to explicitly synchronize the metadata store that knows about the schema of your data, with the actual data itself.
Typically, you'll use the INVALIDATE METADATA command to force following queries to re-read the data. You can also use REFRESH <table-name> if you have just one table that has been updated.

Why hive doesn't allow create external table with CTAS?

In hive, create external table by CTAS is a semantic error, why?
The table created by CTAS is atomic, while external table means data will not be deleted when dropping table, they do not seem to conflict.
In Hive when we create a table(NOT external) the data will be stored in /user/hive/warehouse.
But during External hive table creation the file will be anywhere else, we are just pointing to that hdfs directory and exposing the data as hive table to run hive queries etc.
This SO answer more precisely Create hive table using "as select" or "like" and also specify delimiter
Am I missing something here?
Try this...You should be able to create an external table with CTAS.
CREATE TABLE ext_table LOCATION '/user/XXXXX/XXXXXX'
AS SELECT * from managed_table;
I was able to create one. I am using 0.12.
i think its a semantic error because it misses the most imp parameter of external table definition viz. the External Location of the data file! by definition, 1. External means the data is outside hive control residing outside the hive data warehouse dir. 2. if table is dropped data remains intact only table definition is removed from hive metastore. so,
i. if CTAS is with managed table, the new ext table will have file in warehouse which will be removed with drop table making #2 wrong
ii. if CTAS is with other external table, the 2 tables will point to same file location.
CTAS creates a managed hive table with the new name using the schema and data of the said table.
You can convert it to an external table using:
ALTER TABLE <TABLE_NAME> SET TBLPROPERTIES('EXTERNAL'='TRUE');

Resources