Hive - External table creation - hadoop
I am learning hive and read an article about when to use HIVE external table and mentioned the statement below.
To query data stored in external system such as amazon s3
- Avoid brining in that data into HDFS
Can anyone elaborate above statement. "Avoid brining in that data into HDFS"? Load data local command will help to load local file into HDFS and HIVE is applying the format on the top.
Is it possible to access the data which is out of HDFS?
is it possible to access the data which is out of HDFS?
HIve can read data on any Hadoop Compatible filesystem, not only HDFS.
Can someone elaborate above statement. "Avoid brining in that data into HDFS "?
With the example of S3, you can create an external table with a location of s3a://bucket/path, there's no need to bring it to HDFS unless you really needed the speed of reading HDFS compared to S3. However, to persist a dataset in an ephemeral cloud cluster, results should be written back to whatever long-term storage is provided.
It is possible. You can try this yourself. On CDH, I have a file extn\t.txt
[cloudera#quickstart ~]$ pwd
/home/cloudera
[cloudera#quickstart ~]$ cat extn/t.txt
something
[cloudera#quickstart ~]$
I can now create an external table to access this file as follows
create external table tbl(line string)
location 'file:///home/cloudera/extn'
Describe table
INFO : OK
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| line | string | |
+-----------+------------+----------+--+
1 row selected (0.152 seconds)
0: jdbc:hive2://localhost:10000>
Select
INFO : OK
+------------+--+
| tbl.line |
+------------+--+
| something |
+------------+--+
1 row selected (0.134 seconds)
0: jdbc:hive2://localhost:10000>
Describe formatted
+-------------------------------+----------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| line | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Tue Feb 20 12:49:25 PST 2018 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | file:/home/cloudera/extn | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | false |
| | EXTERNAL | TRUE |
| | numFiles | 0 |
| | numRows | -1 |
| | rawDataSize | -1 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1519159765 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+-----------------------+
Load data is different. Please check this External Table vs Load Data
Related
Kafka Connect to Hive in HDFS - No data loaded into table
I have Kafka Connect set up to dump data from my Topic into HDFS with Hive enabled: "confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092", "connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector", "flush.size": "3", "hdfs.url": "hdfs://namenode:9000", "key.converter": "org.apache.kafka.connect.storage.StringConverter", "logs.dir": "logs", "name": "kafka to hdfs - repos", "topics": "repos", "topics.dir": "topics", "value.converter": "io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url": "http://schema-registry:8081", "hive.integration": "true", "hive.metastore.uris": "thrift://hive-metastore:9083", "schema.compatibility": "BACKWARD" When running the task, data is sent to HDFS: # hdfs dfs -ls /topics/repos/partition=0 Found 3 items -rw-r--r-- 3 appuser supergroup 281 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000000+0000000002.avro -rw-r--r-- 3 appuser supergroup 294 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000003+0000000005.avro -rw-r--r-- 3 appuser supergroup 283 2022-12-02 13:42 /topics/repos/partition=0/repos+0+0000000006+0000000008.avro And the table is created in the Hive metastore: 0: jdbc:hive2://localhost:10000> show tables; +-----------+ | tab_name | +-----------+ | repos | +-----------+ But for some reason the table is empty / no data loaded from the HDFS files: 0: jdbc:hive2://localhost:10000> select * from repos; +------------------+------------------+ | repos.repo_name | repos.partition | +------------------+------------------+ +------------------+------------------+ When looking at the configuration of the generated table, it looks like the hookup is fine: 0: jdbc:hive2://localhost:10000> DESCRIBE FORMATTED repos; +-------------------------------+----------------------------------------------------+-----------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+-----------------------------+ | # col_name | data_type | comment | | | NULL | NULL | | repo_name | string | | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | partition | string | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | null | NULL | | CreateTime: | Sat Dec 03 11:46:56 UTC 2022 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://namenode:9000/topics/repos | NULL | | Table Type: | EXTERNAL_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | EXTERNAL | TRUE | | | bucketing_version | 2 | | | numFiles | 0 | | | numPartitions | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | totalSize | 0 | | | transient_lastDdlTime | 1670068016 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+----------------------------------------------------+-----------------------------+ I can't seem to find any errors being written anywhere in the logs. I think it could be related to the sub-folder "partition=0", but I am not quite sure how Hive deals with this. Maybe I am lacking something in the configuration? Or there is something special that has to be done for the data to be loaded?
query hive json serde table having nested ARRAY & STRUCT combination
Trying to query a json hive table built on top of json data. Using json2Hive was able to generate DDL and was able to create table after removing unnecessary fields. create external table user_tables.sample_json_table ( `apps` struct< `app`: array<struct< `id`: string, `queue`: string, `finalstatus`: string, `trackingurl`: string, `applicationtype`: string, `applicationtags`: string, `startedtime`: string, `launchtime`: string, `finishedtime`: string, `memoryseconds`: string, `vcoreseconds`: string, `resourcesecondsmap`: struct< `entry`: struct< `key`: string, `value`: string > > > > > ) row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe' location '/xyz/location/; Now, stuck trying to figure out how to query each field from the below schema ? checked several articles but all of them are case specific, and need a generic explanation or example how to query each field under array/struct :) I only care about the multiple 'app' subsection entries and would like them to be imported onto another table with separate fields for each fields. Sample json data: {"apps":{"app":[{"id":"application_282828282828_12717","user":"xyz","name":"xyz-4b6bdae2-1a0c-4772-bd8e-0d7454268b82","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12717/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ABC,xyz_20221107070124_2beb5d90-24c7-4b1b-b977-3c9af1397195,userid=dummy","priority":0,"startedtime":1667822485626,"launchtime":1667822485767,"finishedtime":1667822553365,"elapsedtime":67739,"amcontainerlogs":"http://dingdong:8042/node/containerlogs/container_e65_282828282828_12717_01_000001/xyz","amhosthttpaddress":"dingdong:8042","amrpcaddress":"dingdong:46457","masternodeid":"dingdong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1264304,"vcoreseconds":79,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1264304"},"entry":{"key":"vcores","value":"79"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12724","user":"xyz","name":"xyz-94962a3e-d230-4fd0-b68b-01b59dd3299d","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12724/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ZZZ_,xyz_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy","priority":0,"startedtime":1667822585231,"launchtime":1667822585437,"finishedtime":1667822631435,"elapsedtime":46204,"amcontainerlogs":"http://ding:8042/node/containerlogs/container_e65_282828282828_12724_01_000002/xyz","amhosthttpaddress":"ding:8042","amrpcaddress":"ding:46648","masternodeid":"ding:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":5603339,"vcoreseconds":430,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"5603339"},"entry":{"key":"vcores","value":"430"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"time_out","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12736","user":"xyz","name":"xyz-1a9c73ef-2992-40a5-aaad-9f0688bb04f4","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12736/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"BLAHBLAH,xyz_20221107070609_8d261352-3efa-46c5-a5a0-8a3cd745d180,userid=dummy","priority":0,"startedtime":1667822771170,"launchtime":1667822773663,"finishedtime":1667822820351,"elapsedtime":49181,"amcontainerlogs":"http://dong:8042/node/containerlogs/container_e65_282828282828_12736_01_000001/xyz","amhosthttpaddress":"dong:8042","amrpcaddress":"dong:34266","masternodeid":"dong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1300011,"vcoreseconds":89,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1300011"},"entry":{"key":"vcores","value":"89"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12735","user":"xyz","name":"xyz-d5f56a0a-9c6b-4651-8f88-6eaff5953777","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12735/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"HAHAHA_,xyz_20221107070605_a082d9d8-912f-4278-a2ef-5dfe66089fd7,userid=dummy","priority":0,"startedtime":1667822766897,"launchtime":1667822766999,"finishedtime":1667822796759,"elapsedtime":29862,"amcontainerlogs":"http://dung:8042/node/containerlogs/container_e65_282828282828_12735_01_000001/xyz","amhosthttpaddress":"dung:8042","amrpcaddress":"dung:42765","masternodeid":"dung:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":669695,"vcoreseconds":44,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"669695"},"entry":{"key":"vcores","value":"44"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}}]}} sample query output : id | queue | finalStatus | trackingurl |.... ----------------------------------------------------------- application_282828282828_12717 | root.users.dummy | succeeded | ... application_282828282828_12724 | root.users.dummy2 | failed | ....
For anyone looking to perform something similar ,I found this article very helpful with clear explanation: https://community.cloudera.com/t5/Support-Questions/Complex-Json-transformation-using-Hive-functions/m-p/236476 Below is the query to parse using LATERAL VIEW EXPLODE in case people on the same boat: select ex1.* from user_tables.sample_json_table cym LATERAL VIEW OUTER inline(cym.apps.app) ex1; | id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | resourcesecondsmap | | ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ---------------------------------------- | | application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | {"entry":{"key":"vcores","value":"79"}} | | application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | {"entry":{"key":"vcores","value":"430"}} | | application_1667627410794_12736 | root.users.dummy1 | succeeded | http://dang:8088/proxy/application_1667627410794_12736/ | tez | \_sample_job,test-zzz-3efa-46c5-a5a0-8a3cd745d180,userid=dummy1 | 1667822771170 | 1667822773663 | 1667822820351 | 1300011 | 89 | {"entry":{"key":"vcores","value":"89"}} | | application_1667627410794_12735 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12735/ | tez | \_mixed_article,placebo_2-912f-4278-a2ef-5dfe66089fd7,userid=dummy2 | 1667822766897 | 1667822766999 | 1667822796759 | 669695 | 44 | {"entry":{"key":"vcores","value":"44"}} | Add. Note: Although my requirement no longer needs it, but If anyone can suggest how to further parse the last field resourcesecondsmap to populate map key value would be great to know! basically use key value as field and value as actual value in field: Desired Output: | id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | vcores-value | | ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ------------ | | application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | 79 | | application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | 430 |
NiFi CaptureChangeMySQL converts varchar columns to nulls
I have problem with Apache NiFi 1.12.1. For some unknown for me reason CaptureChangeMySQL returns many nulls. Basically, only columns which are int, return correct values. I'm new in a matter of using NiFi so I might miss some obvious thing in configuration. I have following table: create table inventory.abc ( id int auto_increment primary key, first_name varchar(100) not null, last_name varchar(100) not null, age int not null ); Processor config: Bin logs settings: mysql> show variables like '%bin%'; +--------------------------------------------+--------------------------------+ | Variable_name | Value | +--------------------------------------------+--------------------------------+ | bind_address | * | | binlog_cache_size | 32768 | | binlog_checksum | CRC32 | | binlog_direct_non_transactional_updates | OFF | | binlog_error_action | ABORT_SERVER | | binlog_format | ROW | | binlog_group_commit_sync_delay | 0 | | binlog_group_commit_sync_no_delay_count | 0 | | binlog_gtid_simple_recovery | ON | | binlog_max_flush_queue_time | 0 | | binlog_order_commits | ON | | binlog_row_image | FULL | | binlog_rows_query_log_events | OFF | | binlog_stmt_cache_size | 32768 | | binlog_transaction_dependency_history_size | 25000 | | binlog_transaction_dependency_tracking | COMMIT_ORDER | | innodb_api_enable_binlog | OFF | | innodb_locks_unsafe_for_binlog | OFF | | log_bin | ON | | log_bin_basename | /var/lib/mysql/mysql-bin | | log_bin_index | /var/lib/mysql/mysql-bin.index | | log_bin_trust_function_creators | OFF | | log_bin_use_v1_row_events | OFF | | log_statements_unsafe_for_binlog | ON | | max_binlog_cache_size | 18446744073709547520 | | max_binlog_size | 1073741824 | | max_binlog_stmt_cache_size | 18446744073709547520 | | sql_log_bin | ON | | sync_binlog | 1 | +--------------------------------------------+--------------------------------+ 29 rows in set (0.00 sec) And I get results like this: Any idea why I get so many nulls in output? I thought it might be related to Distributed Map Cache Client but since this option is not mandatory I don't think that's a problem.
How inspect Druid datasources with Hive
Yesterday, I was create my first datasource Druid from Hive. Today, I'm not sure that works... First, I ran the following code for create my Db : SET hive.druid.broker.address.default = 10.20.173.30:8082; SET hive.druid.metadata.username = druid; SET hive.druid.metadata.password = druid_password; SET hive.druid.metadata.db.type = postgresql; SET hive.druid.metadata.uri = jdbc:postgresql://10.20.173.31:5432/druid; CREATE EXTERNAL TABLE test ( `__time` TIMESTAMP, `userId` STRING, `lang` STRING, `location` STRING, `name` STRING ) STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' I can see this datasource on my Hive architecture. How can I know that this datasource is a Druid Datasource and not a Hive table. I tested this but I don't know if it's a Druid datasource. DESCRIBE FORMATTED test; Result +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ | # col_name | data_type | comment | | __time | timestamp | from deserializer | | userid | string | from deserializer | | lang | string | from deserializer | | location | string | from deserializer | | name | string | from deserializer | | # Detailed Table Information | NULL | NULL | | Database: | druid_datasources | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Tue Oct 15 12:42:22 CEST 2019 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Location: | hdfs://10.20.173.30:8020/warehouse/tablespace/external/hive/druid_datasources.db/test | NULL | | Table Type: | EXTERNAL_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"__time\":\"true\",\"lang\":\"true\",\"location\":\"true\",\"name\":\"true\",\"userid\":\"true\"}} | | | EXTERNAL | TRUE | | | bucketing_version | 2 | | | druid.datasource | druid_datasources.test || | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | storage_handler | org.apache.hadoop.hive.druid.DruidStorageHandler | | | totalSize | 0 | | | transient_lastDdlTime | 1571136142 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.druid.serde.DruidSerDe | NULL | | InputFormat: | null | NULL | | OutputFormat: | null | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+----------------------------------------------------+----------------------------------------------------+ I did well or it's a Hive table with Druid parameters ? Someone can explain me more about Hive/Druid interactions ? Thanks :D
I think you registered your druid datasource in hive. Now you can run your queries using hive server on top of this table. Your table definition look correct to me I think you managed to integrate druid datasoruce with hive. You can see druid related properties in table. Now when you query the table it will use processing engine depending on the query it will use hive server along with druid. It can use combination of both or one of them on standalone basis to execute query. It depends whether that query can be converted to druid query or not. You can refer to this doc for more info on Hive/Druid interactions : https://cwiki.apache.org/confluence/display/Hive/Druid+Integration (refer:Querying Druid from Hive)
national language supported sort in Hive
Don't have to much experience with nls in hive. Changing locale in client linux shell doesn't affect the result. Googling also doesn't help to resolve. Created table in Hive: create table wojewodztwa (kod STRING, nazwa STRING, miasto_woj STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; loaded data: LOAD DATA LOCAL INPATH ./wojewodztwa.txt OVERWRITE INTO TABLE wojewodztwa; contents of file wojewodztwa.txt: 02,dolnośląskie,Wrocław 04,kujawsko-pomorskie,Bydgoszcz i Toruń 06,lubelskie,Lublin 08,lubuskie,Gorzów Wielkopolski i Zielona Góra 10,łódzkie,Łódź 12,małopolskie,Kraków 14,mazowieckie,Warszawa 16,opolskie,Opole 18,podkarpackie,Rzeszów 20,podlaskie,Białystok 22,pomorskie,Gdańsk 24,śląskie,Katowice 26,świętokrzyskie,Kielce 28,warmińsko-mazurskie,Olsztyn 30,wielkopolskie,Poznań 32,zachodniopomorskie,Szczecin beeline> !connect jdbc:hive2://172.16.45.211:10001 gpadmin changeme org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://172.16.45.211:10001 Connected to: Hive (version 0.11.0-gphd-2.1.1.0) Driver: Hive (version 0.11.0-gphd-2.1.1.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://172.16.45.211:10001> select kod,nazwa from wojewodztwa order by nazwa; +------+----------------------+ | kod | nazwa | +------+----------------------+ | 02 | dolnośląskie | | 04 | kujawsko-pomorskie | | 06 | lubelskie | | 08 | lubuskie | | 14 | mazowieckie | | 12 | małopolskie | | 16 | opolskie | | 18 | podkarpackie | | 20 | podlaskie | | 22 | pomorskie | | 28 | warmińsko-mazurskie | | 30 | wielkopolskie | | 32 | zachodniopomorskie | | 10 | łódzkie | | 24 | śląskie | | 26 | świętokrzyskie | +------+----------------------+ 16 rows selected (19,702 seconds) and it's not correct result, all words starting with language specific characters are at the and.
Hive does not support collations. Strings will sort according to Java String.compareTo rules.