Clickhouse SQL: Reshape data from long format to wide format - clickhouse

I'm using Clickhouse SQL dialect. After array decomposition I have the data in the following format.
|----- |---------------------|----------------|------------------|
| id | timestamp | property_key | property_value |
|----- |---------------------|----------------|------------------|
| 01 | 2019-09-25 16:24:38 | query | Palmera |
|------|---------------------|----------------|------------------|
| 01 | 2019-09-25 16:24:38 | found_items | 10 |
|------|---------------------|----------------|------------------|
| 02 | 2019-09-25 13:11:09 | query | pigeo |
|------|---------------------|----------------|------------------|
| 02 | 2019-09-25 13:11:09 | found_items | 0 |
|------|---------------------|----------------|------------------|
| 03 | 2019-09-25 16:08:13 | query | harmon |
|------|---------------------|----------------|------------------|
| 03 | 2019-09-25 16:08:13 | found_items | 17 |
|------|---------------------|----------------|------------------|
I received such result with a query
SELECT id, timestamp,
properties.key AS property_key,
properties.value as property_value
FROM (
SELECT
rowNumberInAllBlocks() as id,
timestamp,
properties.key,
properties.value
FROM database.table
WHERE timestamp BETWEEN toDateTime('2019-09-16 11:26:56')
AND toDateTime('2019-09-26 11:26:56')
ORDER BY timestamp)
ARRAY JOIN properties
WHERE
properties.key IN ('query', 'found_items')
I need to extract queries whose found_items is equal to 0. I can't get how to reshape data from the long format to wide format. So, the expected result is the following.
|----- |---------------------|-----------------|---------------|
| id | timestamp | query | found_items |
|----- |---------------------|-----------------|---------------|
| 02 | 2019-09-25 13:11:09 | pigeo | 0 |
|------|---------------------|-----------------|---------------|
| 15 | 2019-09-25 16:08:13 | coche | 0 |
|------|---------------------|-----------------|---------------|
| 27 | 2019-09-16 13:19:46 | panitos pampers | 0 |
|------|---------------------|-----------------|---------------|
OR
|----- |---------------------|----------------|------------------|
| id | timestamp | property_key | property_value |
|----- |---------------------|----------------|------------------|
| 02 | 2019-09-25 13:11:09 | query | pigeo |
|------|---------------------|----------------|------------------|
| 15 | 2019-09-25 16:08:13 | query | coche |
|------|---------------------|----------------|------------------|
| 27 | 2019-09-16 13:19:46 | query | panitos pampers |
|------|---------------------|----------------|------------------|

Try this query:
SELECT
id,
groupArray(timestamp)[1] timestamp,
groupArray(properties.key)[1] property_key,
groupArray(properties.value) property_value
FROM (
SELECT
rowNumberInAllBlocks() as id,
timestamp,
properties.key,
properties.value
FROM test.test_011
WHERE timestamp BETWEEN toDateTime('2019-09-16 11:26:56') AND toDateTime('2019-09-26 11:26:56')
AND properties.value[indexOf(properties.key, 'found_items')] = '0'
ORDER BY timestamp)
ARRAY JOIN properties
WHERE properties.key IN ('query' /*, ..*/)
GROUP BY id, properties.key
ORDER BY id
/* Result
┌─id─┬───────────timestamp─┬─property_key─┬─property_value────────┐
│ 0 │ 2019-09-25 13:11:09 │ query │ ['pigeo'] │
│ 1 │ 2019-09-16 13:19:46 │ query │ ['panitos','pampers'] │
└────┴─────────────────────┴──────────────┴───────────────────────┘
*/
/* prepare test data */
CREATE TABLE test.test_011 (
timestamp DateTime,
properties Nested(key String, value String)
) ENGINE = Memory;
INSERT INTO test.test_011
VALUES
(toDateTime('2019-09-25 16:24:38'), ['query', 'found_items'], ['Palmera', '10']),
(toDateTime('2019-09-25 13:11:09'), ['query', 'found_items'], ['pigeo', '0']),
(toDateTime('2019-09-25 16:08:13'), ['query', 'found_items'], ['harmon', '17']),
(toDateTime('2019-09-16 13:19:46'), ['found_items', 'query', 'query'], ['0', 'panitos', 'pampers']),
(toDateTime('2019-09-25 16:22:38'), ['query', 'query'], ['test', 'test']);

Related

query hive json serde table having nested ARRAY & STRUCT combination

Trying to query a json hive table built on top of json data. Using json2Hive was able to generate DDL and was able to create table after removing unnecessary fields.
create external table user_tables.sample_json_table (
`apps` struct<
`app`: array<struct<
`id`: string,
`queue`: string,
`finalstatus`: string,
`trackingurl`: string,
`applicationtype`: string,
`applicationtags`: string,
`startedtime`: string,
`launchtime`: string,
`finishedtime`: string,
`memoryseconds`: string,
`vcoreseconds`: string,
`resourcesecondsmap`: struct<
`entry`: struct<
`key`: string,
`value`: string
>
>
>
>
>
)
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/xyz/location/;
Now, stuck trying to figure out how to query each field from the below schema ?
checked several articles but all of them are case specific, and need a generic explanation or example how to query each field under array/struct :)
I only care about the multiple 'app' subsection entries and would like them to be imported onto another table with separate fields for each fields.
Sample json data:
{"apps":{"app":[{"id":"application_282828282828_12717","user":"xyz","name":"xyz-4b6bdae2-1a0c-4772-bd8e-0d7454268b82","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12717/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ABC,xyz_20221107070124_2beb5d90-24c7-4b1b-b977-3c9af1397195,userid=dummy","priority":0,"startedtime":1667822485626,"launchtime":1667822485767,"finishedtime":1667822553365,"elapsedtime":67739,"amcontainerlogs":"http://dingdong:8042/node/containerlogs/container_e65_282828282828_12717_01_000001/xyz","amhosthttpaddress":"dingdong:8042","amrpcaddress":"dingdong:46457","masternodeid":"dingdong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1264304,"vcoreseconds":79,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1264304"},"entry":{"key":"vcores","value":"79"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12724","user":"xyz","name":"xyz-94962a3e-d230-4fd0-b68b-01b59dd3299d","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12724/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ZZZ_,xyz_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy","priority":0,"startedtime":1667822585231,"launchtime":1667822585437,"finishedtime":1667822631435,"elapsedtime":46204,"amcontainerlogs":"http://ding:8042/node/containerlogs/container_e65_282828282828_12724_01_000002/xyz","amhosthttpaddress":"ding:8042","amrpcaddress":"ding:46648","masternodeid":"ding:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":5603339,"vcoreseconds":430,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"5603339"},"entry":{"key":"vcores","value":"430"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"time_out","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12736","user":"xyz","name":"xyz-1a9c73ef-2992-40a5-aaad-9f0688bb04f4","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12736/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"BLAHBLAH,xyz_20221107070609_8d261352-3efa-46c5-a5a0-8a3cd745d180,userid=dummy","priority":0,"startedtime":1667822771170,"launchtime":1667822773663,"finishedtime":1667822820351,"elapsedtime":49181,"amcontainerlogs":"http://dong:8042/node/containerlogs/container_e65_282828282828_12736_01_000001/xyz","amhosthttpaddress":"dong:8042","amrpcaddress":"dong:34266","masternodeid":"dong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1300011,"vcoreseconds":89,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1300011"},"entry":{"key":"vcores","value":"89"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12735","user":"xyz","name":"xyz-d5f56a0a-9c6b-4651-8f88-6eaff5953777","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12735/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"HAHAHA_,xyz_20221107070605_a082d9d8-912f-4278-a2ef-5dfe66089fd7,userid=dummy","priority":0,"startedtime":1667822766897,"launchtime":1667822766999,"finishedtime":1667822796759,"elapsedtime":29862,"amcontainerlogs":"http://dung:8042/node/containerlogs/container_e65_282828282828_12735_01_000001/xyz","amhosthttpaddress":"dung:8042","amrpcaddress":"dung:42765","masternodeid":"dung:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":669695,"vcoreseconds":44,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"669695"},"entry":{"key":"vcores","value":"44"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}}]}}
sample query output :
id | queue | finalStatus | trackingurl |....
-----------------------------------------------------------
application_282828282828_12717 | root.users.dummy | succeeded | ...
application_282828282828_12724 | root.users.dummy2 | failed | ....
For anyone looking to perform something similar ,I found this article very helpful with clear explanation: https://community.cloudera.com/t5/Support-Questions/Complex-Json-transformation-using-Hive-functions/m-p/236476
Below is the query to parse using LATERAL VIEW EXPLODE in case people on the same boat:
select ex1.* from user_tables.sample_json_table cym LATERAL VIEW OUTER inline(cym.apps.app) ex1;
| id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | resourcesecondsmap |
| ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ---------------------------------------- |
| application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | {"entry":{"key":"vcores","value":"79"}} |
| application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | {"entry":{"key":"vcores","value":"430"}} |
| application_1667627410794_12736 | root.users.dummy1 | succeeded | http://dang:8088/proxy/application_1667627410794_12736/ | tez | \_sample_job,test-zzz-3efa-46c5-a5a0-8a3cd745d180,userid=dummy1 | 1667822771170 | 1667822773663 | 1667822820351 | 1300011 | 89 | {"entry":{"key":"vcores","value":"89"}} |
| application_1667627410794_12735 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12735/ | tez | \_mixed_article,placebo_2-912f-4278-a2ef-5dfe66089fd7,userid=dummy2 | 1667822766897 | 1667822766999 | 1667822796759 | 669695 | 44 | {"entry":{"key":"vcores","value":"44"}} |
Add. Note: Although my requirement no longer needs it, but If anyone can suggest how to further parse the last field resourcesecondsmap to populate map key value would be great to know! basically use key value as field and value as actual value in field:
Desired Output:
| id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | vcores-value |
| ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ------------ |
| application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | 79 |
| application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | 430 |

How inspect Druid datasources with Hive

Yesterday, I was create my first datasource Druid from Hive. Today, I'm not sure that works...
First, I ran the following code for create my Db :
SET hive.druid.broker.address.default = 10.20.173.30:8082;
SET hive.druid.metadata.username = druid;
SET hive.druid.metadata.password = druid_password;
SET hive.druid.metadata.db.type = postgresql;
SET hive.druid.metadata.uri = jdbc:postgresql://10.20.173.31:5432/druid;
CREATE EXTERNAL TABLE test (
`__time` TIMESTAMP,
`userId` STRING,
`lang` STRING,
`location` STRING,
`name` STRING
)
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
I can see this datasource on my Hive architecture. How can I know that this datasource is a Druid Datasource and not a Hive table.
I tested this but I don't know if it's a Druid datasource.
DESCRIBE FORMATTED test;
Result
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| __time | timestamp | from deserializer |
| userid | string | from deserializer |
| lang | string | from deserializer |
| location | string | from deserializer |
| name | string | from deserializer |
| # Detailed Table Information | NULL | NULL |
| Database: | druid_datasources | NULL |
| OwnerType: | USER | NULL |
| Owner: | hive | NULL |
| CreateTime: | Tue Oct 15 12:42:22 CEST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://10.20.173.30:8020/warehouse/tablespace/external/hive/druid_datasources.db/test | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"__time\":\"true\",\"lang\":\"true\",\"location\":\"true\",\"name\":\"true\",\"userid\":\"true\"}} |
| | EXTERNAL | TRUE |
| | bucketing_version | 2 |
| | druid.datasource | druid_datasources.test ||
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | storage_handler | org.apache.hadoop.hive.druid.DruidStorageHandler |
| | totalSize | 0 |
| | transient_lastDdlTime | 1571136142 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.druid.serde.DruidSerDe | NULL |
| InputFormat: | null | NULL |
| OutputFormat: | null | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
I did well or it's a Hive table with Druid parameters ?
Someone can explain me more about Hive/Druid interactions ?
Thanks :D
I think you registered your druid datasource in hive. Now you can run your queries using hive server on top of this table.
Your table definition look correct to me I think you managed to integrate druid datasoruce with hive. You can see druid related properties in table.
Now when you query the table it will use processing engine depending on the query it will use hive server along with druid. It can use combination of both or one of them on standalone basis to execute query. It depends whether that query can be converted to druid query or not.
You can refer to this doc for more info on Hive/Druid interactions : https://cwiki.apache.org/confluence/display/Hive/Druid+Integration (refer:Querying Druid from Hive)

Hive date format matching

How can i match particular date format in hive query, as i have to get those rows having date format other than max of rows.
Eg. My max of rows have date format as MM/dd/yyyy and i have to list all rows other than above format
+----------------------------+--------------------------+--------------------+-----------+
| AllocationActBankAccountID | GiftCardActBankAccountID | UpdateTimeStampUtc | Date |
+----------------------------+--------------------------+--------------------+-----------+
| 14 | 14 | 41:39.8 | 4/19/2016 |
| 14 | 14 | 37:16.4 | 4/20/2016 |
| 14 | 14 | 52:15.2 | 4/21/2016 |
| 14 | 14 | 52:15.2 | 2/11/2019 |
| 14 | 14 | 52:15.2 | 12-Feb-19 |*
| 14 | 14 | 41:39.8 | 2/13/2019 |
+----------------------------+--------------------------+--------------------+-----------+
I want to get * marked data (Date = 12-Feb-19)
select *
from mytable
Where date not rlike '^([1-9]|1[0-2])/([1-9]|[1-2][0-9]|3[0-1])/(19|20)\\d{2}$'
or
select *
from mytable
Where not
(
date rlike '^\\d{1,2}/\\d{1,2}/\\d{4}$'
and cast(split (date,'/')[0] as int) between 1 and 12
and cast(split (date,'/')[1] as int) between 1 and 31
and cast(split (date,'/')[2] as int) between 1900 and 2099
)
or
select date
from mytable
Where coalesce(from_unixtime(to_unix_timestamp(date,'M/d/y')),'0000-01-01')
not between date '1900-01-01' and date '2099-12-31'

Hive - same sequence of records in array

I have a table with data at hour level. I want to find the count of hours and the values for col1 and col2 for all hours in an array. Input Table
+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00 | 0.0 | a |
| 04 | 0.1 | b |
| 08 | 0.2 | c |
| 12 | 0.0 | d |
+-----+-----+-----+
I am using the below query to get the column values in an array
Query:
select count(hr), map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string)))))) as col1_arr, map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col2 as string)))))) as col2_arr from table;
Output that i am getting, values in col2_arr are not in the same sequence with col1_arr. Please suggest how can i get the values in array/list for different columns in same sequence.
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | b,a,c,d |
+----------+----------------+-----------+
Required output:
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | a,b,c,d |
+----------+----------------+-----------+
Thanks
select count(*) as cnt
,concat_ws(',',sort_array(collect_list(hour))) as hour
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,cast(col1 as string))))),'..:','') as col1
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,col2)))),'..:','') as col2
from mytable
;
+-----+-------------+-------------+---------+
| cnt | hour | col1 | col2 |
+-----+-------------+-------------+---------+
| 4 | 00,04,08,12 | 0,0.1,0.2,0 | a,b,c,d |
+-----+-------------+-------------+---------+

national language supported sort in Hive

Don't have to much experience with nls in hive. Changing locale in client linux shell doesn't affect the result.
Googling also doesn't help to resolve.
Created table in Hive:
create table wojewodztwa (kod STRING, nazwa STRING, miasto_woj STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
loaded data:
LOAD DATA LOCAL INPATH ./wojewodztwa.txt OVERWRITE INTO TABLE wojewodztwa;
contents of file wojewodztwa.txt:
02,dolnośląskie,Wrocław
04,kujawsko-pomorskie,Bydgoszcz i Toruń
06,lubelskie,Lublin
08,lubuskie,Gorzów Wielkopolski i Zielona Góra
10,łódzkie,Łódź
12,małopolskie,Kraków
14,mazowieckie,Warszawa
16,opolskie,Opole
18,podkarpackie,Rzeszów
20,podlaskie,Białystok
22,pomorskie,Gdańsk
24,śląskie,Katowice
26,świętokrzyskie,Kielce
28,warmińsko-mazurskie,Olsztyn
30,wielkopolskie,Poznań
32,zachodniopomorskie,Szczecin
beeline> !connect jdbc:hive2://172.16.45.211:10001 gpadmin changeme org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://172.16.45.211:10001
Connected to: Hive (version 0.11.0-gphd-2.1.1.0)
Driver: Hive (version 0.11.0-gphd-2.1.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://172.16.45.211:10001> select kod,nazwa from wojewodztwa order by nazwa;
+------+----------------------+
| kod | nazwa |
+------+----------------------+
| 02 | dolnośląskie |
| 04 | kujawsko-pomorskie |
| 06 | lubelskie |
| 08 | lubuskie |
| 14 | mazowieckie |
| 12 | małopolskie |
| 16 | opolskie |
| 18 | podkarpackie |
| 20 | podlaskie |
| 22 | pomorskie |
| 28 | warmińsko-mazurskie |
| 30 | wielkopolskie |
| 32 | zachodniopomorskie |
| 10 | łódzkie |
| 24 | śląskie |
| 26 | świętokrzyskie |
+------+----------------------+
16 rows selected (19,702 seconds)
and it's not correct result, all words starting with language specific characters are at the and.
Hive does not support collations. Strings will sort according to Java String.compareTo rules.

Resources