How to query sorted/indexed columns in Impala - sorting

I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;

Related

Rules to be followed before creating a Hive partitioned table

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?
What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.

HBase to Hive Mapping table is not showing up complete data

We have a HBase table with 1 column family and has 1.5 billion records in it.
HBase Row count was retrieved using command
"count '<tablename>'", {CACHE => 1000000}.
And HBase to Hive Mapping was done with the below command.
create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
')
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');
But While we retrieve the Hive Row Count using the below command,
select count(*) from stagingdata;
It only shows up 140 million rows in the Hive Mapped Table.
We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.
My Question is why the complete 1.5 billion records are not showing up in Hive?
Are we missing here anything ?
Your Immediate Answer would be highly appreciated.
Thanks,
Madhu.
What you see in hive is the latest version per key and not all the versions of a key
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp.
Hive HBase Integration

Spark write data into partitioned Hive table very slow

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
My DataFrame is like:
final_data1_df = sqlContext.sql("select a, b from final_data")
and I am trying to write it by:
final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")
but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:
sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")
but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?
When you create the table explicitly then that DDL defines the table.
Normally text file is the default in Hive but it could have been changed in your environment.
Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

WHY does this simple Hive table declaration work? As if by magic

The following HQL works to create a Hive table in HDInsight which I can successfully query. But, I have several questions about WHY it works:
My data rows are, in fact, terminated by carriage return line feed, so why does 'COLLECTION ITEMS TERMINATED BY \002' work? And what is \002 anyway? And no location for the blob is specified so, again, why does this work?
All attempts at creating the same table and specifying "CREATE EXTERNAL TABLE...LOCATION '/user/hive/warehouse/salesorderdetail'" have failed. The table is created but no data is returned. Leave off "external" and don't specify any location and suddenly it works. Wtf?
CREATE TABLE IF NOT EXISTS default.salesorderdetail(
SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
Any insights are greatly appreciated.
UPDATE:Thanks for the help so far. Here's the exact syntax I'm using to attempt external table creation. (I've only changed the storage account name.) I don't see what I'm doing wrong.
drop table default.salesorderdetailx;
CREATE EXTERNAL TABLE default.salesorderdetailx(SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
LOCATION 'wasb://mycn-1#my.blob.core.windows.net/mycn-1/hive/warehouse/salesorderdetailx'
When you create your cluster in HDInsight, you have to specify underlying blob storage. It assumes that you are referencing that blob storage. You don't need to specific a location because your query is creating an internal table (see answer #2 below) which is created at a default location. External tables need to specify a location in Azure blob storage (outside of the cluster) so that the data in the table is not deleted when the cluster is dropped. See the Hive DDL for more information.
By default, tables are created as internal, and you have to specify the "external" to make them external tables.
Use EXTERNAL tables when:
Data is used outside Hive
You need data to be updateable in real time
Data is needed when you drop the cluster or the table
Hive should not own data and control settings, directories, etc.
Use INTERNAL tables when:
You want Hive to manage the data and storage
Short term usage (like a temp table)
Creating table based on existing table (AS SELECT)
Does the container "user/hive/warehouse/salesorderdetail" exist in your blob storage? That might explain why it is failing for your external table query.

Pig: get data from hive table and add partition as column

I have a partitioned Hive table that i want to load in a Pig script and would like to add partition as column also.
How can I do that?
Table definition in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS transactions
(
column1 string,
column2 string
)
PARTITIONED BY (datestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/path';
Pig script:
%default INPUT_PATH '/path'
A = LOAD '$INPUT_PATH'
USING PigStorage('|')
AS (
column1:chararray,
column2:chararray,
datestamp:chararray
);
The datestamp column is not populated. Why is it so?
I am sorry I didn't get the part which says add partition as column also. Once created, partition keys behave like regular columns. What exactly do you need?
And you are loading the data directly from a given HDFS location, not as a Hive table. If you intend to use Pig to load/store data from/into a Hive table you should use HCatalog.
For example :
A = LOAD 'transactions' USING org.apache.hcatalog.pig.HCatLoader();

Resources