default storage file format in hadoop/hdfs - hadoop

I am setting up a new hadoop cluster (experimental at this stage).
I want it to be configured such that whenever a file is copied onto the cluster (either through copyFromLocal or using sqoop etc), hadoop/hdfs should store the data in parquet file format.
Am I expecting right about this ? is it possible ?
I thought there should be a configuration parameter somewhere at the hdfs level, where i could specify which format to use while storing data, somehow not able to find that. Wondering if i m missing something here.

No you're right - there's no HDFS-level configuration. You'd have to set the storage format each time you operate on some data. Imagine the damage that would be done if every file was automatically converted into Parquet. All the temporary files created by applications, any Hive/Pig scripts and any lookup files would be ruined.
To save the output of a Sqoop command into Parquet:
sqoop import --connect JDBC_URI --table TABLE --as-parquetfile --target-dir /path/to/files
will write the data into Parquet format.
There's no way to do this with a copyFromLocal.
To move data that's already on the HDFS into Parquet, load the data into an external Hive table in its original format, create a Parquet table and then load the data into it, i.e.
//Overlay a table onto the input data on the HDFS
CREATE EXTERNAL TABLE input (
id int,
str string
STORED AS <the-input-data-format>
LOCATION 'hdfs://<wherever-you-put-the-data>';
//Create a Parquet-formatted table
CREATE TABLE parquet (
id int,
str string
STORED AS PARQUET;
//Write your input data into the Parquet table - this will format the data into Parquet
INSERT INTO TABLE parquet
SELECT * FROM input;

Related

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Create a HIVE table and save it to a tab-separated file?

I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory

Creating an ORC file and not Hive table?

From what I googled around and found are ways of creating an ORC table using Hive but I want a an ORC file on which I can run my custom map-reduce job.
Also please let me know that the file created by Hive under the warehouse directory for my ORC table is a table file of ORC and not an actutal ORC file I can use? like: /user/hive/warehouse/tbl_orc/000000_0
[Wrap-up of the discussion]
a Hive table is mapped on a HDFS directory (or a list of
directories, if the table is partitioned)
all files in that directory use the same SerDe (ORC, Parquet, AVRO,
Text, etc.) and have the same column set; all together, they contain all the data available for that table
each file in that directory is the result of a previous MapReduce job
-- either a Hive INSERT, a Pig dataset saved via HCatalog, a Spark dataset saved via HiveContext... or any custom job that happens to
drop a file there, hopefully compliant with the table SerDe and
schema (retrieved via MetastoreClient Java API, or via HCatalog API,
whatever)
note that a single job with 3 reducers will probably create 3 new
files (and maybe 1 empty file + 1 small file + 1 big file!); and a
job with 24 mappers and no reducer will create 24 files, unless some
kind of "merge small files" post-processing step is enabled
note also that most file names give absolutely no information about
the way the file is encoded intenally, they are just sequence numbers
(i.e. the 5th job to add 12 files will typically create files 000004_0 to
000004_11)
All in all, processing an ORC fileset with a Java MapReduce program should be very similar to processing a Text fileset. You just have to provide the correct SerDe and the correct field mapping -- I think that the encryption algorithm is explicit in the files so the Serde handles it auto-magically at read time. Just remember that ORC files are not splittable at record level, but at stripe level (a stripe is a bunch of record stored in columnar format w/ tokenization and optional compression).
Of course, that will not give you access to ORC advanced features such a vectorization or stripe pruning (somewhat similar to "smart scan" in Oracle Exadata).

Where is HIVE metadata stored by default?

I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.
Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.
Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.
After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!

Is it possible to import data into Hive table without copying the data

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.
the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.
If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.
LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Notice the LOCAL
You can use alter table partition statement to avoid data duplication.
create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';
ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.

Resources