Stop sqoop from converting datetime to bigint - hadoop

Recently I noticed that whenever I ingest from a SQL database using Sqoop, all datetime fields are converted to a bigint (epoch * 1000) instead of to String.
Important to note: I'm storing as parquet.
I have been trying a bunch of sqoop flags like "--map-column-java" but I don't want to manually define this for hundreds of columns in thousands of tables.
What flag am I missing to prevent this sqoop behaviour?
It seems that sqoop didn't do this when storing in plain text.

Instead of letting sqoop do its arcane magic on my tables, I decided to do the following:
Ingest to a temporary table, stored as text.
Create a table (if not exists) like the temporary table, stored as parquet
insert overwrite the text stored temporary table into the parquet stored table
This allows for proper date formatting without the hassle with (maybe not existing) configuration and settings tweaking in Sqoop.
The only tradoff is that it's slightly slower

Related

How to import data from parquet file to existing Hadoop table?

I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Create a HIVE table and save it to a tab-separated file?

I have some data in hdfs.
This data was migrated from a PostgreSQL database by using Sqoop.
The data has the following hadoopish format, like _SUCCESS, part-m-00000, etc.
I need to create a Hive table based on this data and then I need to export this table to a single tab-separated file.
As far as I know, I can create a table this way.
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Then I can save the table as tsv file:
hive -e 'select * from some_table' > /home/myfile.tsv
I don't know how to load data from hdfs into a Hive table.
Moreover, should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
I don't know how to load data from hdfs into Hive table
You create a table schema over a hdfs directory like you're doing.
should I manually define the structure of a table using create or is there any automated way when all columns are created automatically?
Unless you didn't tell sqoop to create the table, you must do it manually.
export this table into a single tab-separated file.
A query might work, or unless sqoop set the delimiter to \t, then you need to create another table from the first specifying such column separator. And then, you don't even need to query the table, just run hdfs dfs -getMerge on the directory

Handling dates in Hadoop

I'm new to the Big Data/Hadoop ecosystem and have noticed that dates are not always handled in standard way across technologies. I plan to be ingesting data from Oracle into Hive tables on an HDFS using Sqoop with Avro and Parquet file formats. Hive continues to import my dates into BIGINT values, I'd prefer TIMESTAMPS. I've tried using the "--map-column-hive" overrides... but it still does not work.
Looking for suggestions on the best way to handle dates for this use case.
Parquet File Format
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Avro File Format
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the EXTRACT() function.
You can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Other workaround is to import data using --query in sqoop command, where you can cast your column into timestamp format.
Example
--query 'SELECT CAST (INSERTION_DATE AS TIMESTAMP) FROM tablename WHERE $CONDITIONS'
If your SELECT query gets a bit long, you can use configuration files to shorten the length of the command line call. Here is the reference

measure the time of load tables with data in hive (its possible?)

I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.

Resources