Cannot read timestamp data from s3 table with parquet data through hive , LongWritable cannot be cast to TimestampWritableV2 - hadoop

So I'm trying to read a table thats pointed to an s3 bucket with parquet files. The table ddl has input format as : org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
and output format as: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
I'm getting this error when doing a simple select * from table.
org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2
I'm able to query the data in Athena and I've searched endlessly on google. One solution I've seen is to recreate the table but change the data type to string and then convert to timestamp when querying the table.

Related

Clickhouse : Inserting data with missing columns from parquet file

I have hundreds of different parquet files that I want to add to a single table on a Clickhouse database. They all contain the same type of data, but some of them have a few missing columns.
Is there still a way to add the data directly from those parquet files using a query such as
cat {file_path} | clickhouse-client --query="INSERT INTO table FORMAT Parquet"?
If I try doing this, I get an error like this one :
Code: 8. DB::Exception: Column "column_name" is not presented in input data: data for INSERT was parsed from stdin
I tried adding to the missing column a NULL or DEFAULT value when creating the table, but I still get the same result, and the exception results in not adding any data from the concerned parquet file.
Is there an easy way to do this with Clickhouse, or do I just have either to fix my parquet files, or preprocessing my data and inserting it with another type of query format that doesn't use parquet?

How do I convert a sequence file to parquet format

I have a HIVE Table (test) that I need to create in the PARQUET format. I will be using a bunch of SEQUENCE files in order to create and insert into a table.
Once the table is created, is there a way to convert into PARQUET? I mean I know we could have done, say
CREATE TABLE default.test( user_id STRING, location STRING)
PARTITIONED BY ( dt INT ) STORED AS PARQUET
initially while creating the table itself. However, in my case I am forced to use SEQUENCE files to create the table first because it is the format that I have to begin with and cannot directly convert to PARQUET.
Is there a way I could convert into parquet after the table is created and data inserted?
To convert form sequence file to Parquet you need to load the data (CTAS) into a new table.
The question is tagged with presto, so I am giving you Presto syntax for this. I am including partitioning, because example in the question contains it.
CREATE TABLE test_parquet WITH(format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT * FROM test_sequencefile;

How to convert existing text data in hdfs to Avro?

I have a table in hdfs which is stored in Text format, so now i have a requirement to add new column in between. So I thought to load new columns in avro as Avro supports schema evolution,but now the previous data is still in text format.
if you already have a table you can load that directly into avro table from hive, if not you can create hive table for that text file and load that to avro table.
Something like
create table test(fields type) row format delimited fields terminated by ',' stored as textile location 'textfilepath';
create table avrotbl like test stored as avrofile;
insert into abrotbl select * from test;

Data type mismatch in hive

I created an external table in hive where divcd is string datatype. The data is stored in avro format. Below is the issue I face while querying
WHERE divcd = ‘20’
However, this returned no rows. When I tried:
WHERE divcd = 20
returned rows.
I'm unable to sort it out. Any help would be appreciable.

HDFS String data to timestamp in hive table

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps
You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Resources