Impala can perform a COUNT(*) but not a SELECT * from a table - hadoop

I came across a bizarre Impala behaviour. I've create a table in HUE from a .csv file I've copied into the Hadoop cluster. I can correctly navigate the table in HUE via the Metastore Manager but I can't run the following query in Impala, as it throws an IllegalStateException: null exception:
select *
from my_db.my_table
limit 100;
The strange thing is that the following command retrieve the correct number of rows:
select
count(*)
from my_db.my_table;

The error is caused by invalid types. Not all hive data types are supported in impala. Impala has a timestamp and no date type. When your table has date type it will show as invalid_type in impala when described and impala cannot select this data type. For solution try changing the column to timestamp
Describe <table name>;
| invalid_type | |
| invalid_type | |

I'm getting the exact same issue. I changed the query to select each column from the table individually (i.e. select col1, col2, col3...etc.) and found that Impala didn't like a date datatype column. Changing it to timestamp fixed the issue and I can now do a select * from the table.

Related

Ingesting Timestamp field from Oracle to HDFS using NIFI

I am trying to insert a table from Oracle to HDFS using Nifi. The source table in Oracle has a timestamp(6) datatype field called sd_timestamp.
Nifi is having the following processor:
QueryDatabase:
This queries the OracleDatabase.
CovertAVROSchema:
This one has input and output schemas. Both input and output schemas have sd_timestamp datatype as String.
ConvertAvroToOrc
PutHDFS:
The table that is created in Hive also has the datatype as string for sd_timestamp. When the ingestion is done and I do a select * from the destination hive table, I am getting oracle.sql.timestamp#23aff4 as the value instead of the timestamp.
Please help.
Here are details of what I did to get it working. Did not require the ConvertAvroSchema step.
Oracle table
CREATE TABLE my_table
(
entry_name varchar(10),
sd_timestamp timestamp(6)
);
Populate some data
insert into my_table values('e-1',CURRENT_TIMESTAMP);
insert into my_table values('e-2',CURRENT_TIMESTAMP);
insert into my_table values('e-3',CURRENT_TIMESTAMP);
Verify data
SELECT * FROM my_table;
ENTRY_NAME SD_TIMESTAMP
e-1 09-MAY-18 06.45.24.963327000 PM
e-2 09-MAY-18 06.45.39.291241000 PM
e-3 09-MAY-18 06.45.44.748736000 PM
NiFi Flow
Flow Design
QueryDatabaseTable configuration
ConvertAvroToOrc configuration
PutHDFS configuration
LogAttribute to see the hive.ddl attribute value
Verify results on HDFS
$ hadoop fs -ls /oracle-ingest
/oracle-ingest/50201861895275.orc
Create Hive table to query data using the hive.ddl value and adding location to it
hive> CREATE EXTERNAL TABLE IF NOT EXISTS my_oracle_table
(
ENTRY_NAME STRING,
SD_TIMESTAMP STRING
)
STORED AS ORC
LOCATION '/oracle-ingest';
Query Hive table
hive> select * from my_oracle_table;
e-1 2018-05-09 18:45:24.963327
e-2 2018-05-09 18:45:39.291241
e-3 2018-05-09 18:45:44.748736
I am able to resolve the error by adding the following java argument to the bootstrap.conf file present in Nifi/Conf directory
-Doracle.jdbc.J2EE13Compliant=true

select all but few columns in impala

Is there a way to replicate the below in impala?
SET hive.support.quoted.identifiers=none
INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal'
Basically I have a table in hive as text with 1000 fields, and I need a select that drops off the field A. The above works for Hive but now impala, how can I do this in impala without specifying all other 999 fields directly?

Creating view in HIVE

I want to create a view on a hive table which is partitioned . My view definition is as below:
create view schema.V1 as select t1.* from scehma.tab1 as t1 inner join (select record_key ,max(last_update) as last_update from scehma.tab1 group by record_key) as t2 on t1.record_key=t2.record_key and t1.last_update=t2.last_update
My table of tab1 is partitioned on quarter_id.
When i run any query on the view it gives error:
FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "V1:t2:tab1" Table "tab1"
Regards
Jayanta Layak
Your Hive settings must be set to execute jobs in Strict mode (Default in Hive 2.x). This prevents queries of partitioned tables without a WHERE clause that filters on partitions.
If you need to run a query across all partitions(full table scan) you can set the mode to
'nonstrict'. Use this property with care as it triggers enormous mapreduce jobs.
set hive.mapred.mode=nonstrict;
If you don't need an entire table scan, you can simply specify the partition value in your query's WHERE clause.

hive - how it works internally

For eg:
select * from <tablename> where <condition>
select sum() from tablename where <condition>
basically filter, grouping, aggregation - it will generate MR Job and we will be able to see that in Resource Manager UI.
Lets say for eg:
show tables
show database
select * from tablename
select count(*) from tablename
Describe commands
These types of queries doesn't require MR jobs and will not show in RM, since this information are available in MetaStore as a properties.
Does hive logs some where? Can we identify those queries?
By default hive stores logs in /tmp/user directory. But you can set this using hive.querylog.location property in hive-site.xml file.

Hive - How to query a table to get its own name?

I want to write a query such that it returns the table name (of the table I am querying) and some other values. Something like:
select table_name, col1, col2 from table_name;
I need to do this in Hive. Any idea how I can get the table name of the table I am querying?
Basically, I am creating a lookup table that stores the table name and some other information on a daily basis in Hive. Since Hive does not (at least the version we are using) support full-fledged INSERTs, I am trying to use the workaround where we can INSERT into a table with a SELECT query that queries another table. Part of this involves actually storing the table name as well. How can this be achieved?
For the purposes of my use case, this will suffice:
select 'table_name', col1, col2 from table_name;
It returns the table name with the other columns that I will require.

Resources