I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory
Related
When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.
I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".
Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore
I have to create a hive table from data present in oracle tables.
I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files.
The sqoop completes successfully and the files also get generated in the HDFS target directory.
Then I run the create table script in hive. The tables gets created. But it is an empty table, no data is seen in the hive table.
Has anyone faced a similar problem?
Hive default delimiter is ctrlA, if you don't specify any delimiter it will take default delimiter. Add below line in your hive script .
row format delimited fields terminated by '\t'
Your Hive script and your expectation is wrong. You are trying to create a partitioned table on the data that you have already imported, partitions won't work that way. If your query has no partition in it then you can able to see data.
Basically If you want partitioned table , you can't create on the under lying data like you have tried above. If you want hive partition load the data from intermediate table or that sqoop directory to your partitioned table to get Hive partitions.
I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??
So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable