How to extract json from json in clickhouse? - clickhouse

I have got a json in my base:
{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}]}
I need to get all unique values for every key, but I have some problems with extracting values for key 'd' and key 'e'.
Using:
SELECT
DISTINCT JSONExtractRaw(column, 'c')
FROM t1
I get:
[{"d":3,"e":"str_1"},
{"d":4,"e":"str_2"}]
But if I use JsonExtract variety again for key 'd' and key 'e' it returns nothing. How to solve this problem?

If needed I would use 'safe' query like this that correctly processed unordered members and missed ones. This way is not pretty fast but reliable.
SELECT
json,
a_and_b,
d_uniq_values,
e_uniq_values
FROM (
SELECT
json,
JSONExtract(json, 'Tuple(a Nullable(Int32), b Nullable(Int32))') a_and_b,
JSONExtractRaw(json, 'c') c_json,
range(JSONLength(c_json)) AS array_indices,
arrayDistinct(arrayMap(i -> JSONExtractInt(c_json, i + 1, 'd'), array_indices)) AS d_uniq_values,
arrayDistinct(arrayMap(i -> JSONExtractString(c_json, i + 1, 'e'), array_indices)) AS e_uniq_values
FROM
(
/* test data */
SELECT arrayJoin([
'{}',
'{"a":1,"b":2}',
'{"b":1,"a":2}',
'{"b":1}',
'{"a":1,"b":2,"c":[]}',
'{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}]}',
'{"b":1,"a":2,"c":[{"e":"3","d":1}, {"e":"4","d":2}]}',
'{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}, {"d":3,"e":"str_1"}, {"d":4,"e":"str_1"}, {"d":7,"e":"str_9"}]}'
]) AS json
))
FORMAT Vertical;
/* Result:
Row 1:
──────
json: {}
a_and_b: (NULL,NULL)
d_uniq_values: []
e_uniq_values: []
Row 2:
──────
json: {"a":1,"b":2}
a_and_b: (1,2)
d_uniq_values: []
e_uniq_values: []
Row 3:
──────
json: {"b":1,"a":2}
a_and_b: (2,1)
d_uniq_values: []
e_uniq_values: []
Row 4:
──────
json: {"b":1}
a_and_b: (NULL,1)
d_uniq_values: []
e_uniq_values: []
Row 5:
──────
json: {"a":1,"b":2,"c":[]}
a_and_b: (1,2)
d_uniq_values: []
e_uniq_values: []
Row 6:
──────
json: {"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}]}
a_and_b: (1,2)
d_uniq_values: [3,4]
e_uniq_values: ['str_1','str_2']
Row 7:
──────
json: {"b":1,"a":2,"c":[{"e":"3","d":1}, {"e":"4","d":2}]}
a_and_b: (2,1)
d_uniq_values: [1,2]
e_uniq_values: ['3','4']
Row 8:
──────
json: {"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}, {"d":3,"e":"str_1"}, {"d":4,"e":"str_1"}, {"d":7,"e":"str_9"}]}
a_and_b: (1,2)
d_uniq_values: [3,4,7]
e_uniq_values: ['str_1','str_2','str_9']
*/

WITH
(
SELECT '{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}]}'
) AS j
SELECT JSONExtract(j, 'c', 'Array(Tuple(Int64,String))')
┌─JSONExtract(j, 'c', 'Array(Tuple(Int64,String))')─┐
│ [(3,'str_1'),(4,'str_2')] │
└───────────────────────────────────────────────────┘
WITH
(
SELECT '{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}, {"d":4,"e":"str_2"}]}'
) AS j
SELECT
JSONExtract(j, 'c', 'Array(Tuple(Int64,String))') AS t,
arrayReduce('groupUniqArray', arrayMap(x -> (x.1), t)) AS d,
arrayReduce('groupUniqArray', arrayMap(x -> (x.2), t)) AS e
┌─t─────────────────────────────────────┬─d─────┬─e─────────────────┐
│ [(3,'str_1'),(4,'str_2'),(4,'str_2')] │ [4,3] │ ['str_2','str_1'] │
└───────────────────────────────────────┴───────┴───────────────────┘

Related

How do lightgbm encode categorial features?

I have the following structure of one lightGbm tree:
{'split_index': 0,
'split_feature': 41,
'split_gain': 97.25859832763672,
'threshold': '3||4||8',
'decision_type': '==',
'default_left': False,
'missing_type': 'None',
'internal_value': 0,
'internal_weight': 0,
'internal_count': 73194,
'left_child': {'split_index': 1,
and the feature in 0 node is categorial and I feed this feature in format "category".
where can I find the appropriate between number format and category?
The numbers you see are the values of the codes attribute of your categorical features. For example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'a', 'b'], dtype='category')
print(s.cat.codes)
# 0 0
# 1 1
# 2 0
# 3 0
# 4 1
# dtype: int8
so in this case 0 is a and 1 is b.
You can build a mapping from the category code to the value with something like the following:
dict(enumerate(s.cat.categories))
# {0: 'a', 1: 'b'}
If the categories in your column don't match the ones in the model, LightGBM will update them.

How replace each accented characters with non-accented characters foreach word in array in clickhouse?

I have a array of words, ['camión', 'elástico', 'Árbol'] and I want replace accented characters with non-accented characters for each word in array (['camion', 'elastico', 'Arbol'])
I'm searching some as this
SELECT arrayMap(x -> replaceRegexpAll(x, ['á', 'é', 'í', 'ó', 'ú'], ['a', 'e', 'i', 'o', 'u']), ['camión', 'elástico', 'Árbol']) AS word
And I want this result:
['camion', 'elastico', 'arbol']
Replacing each characters accents to withouth accent, but this doesn't work...
Any idea from solve?
Thanks
SELECT arrayMap(x -> arrayStringConcat(
arrayMap(y -> if((indexOf(['á', 'é', 'í', 'ó', 'ú'],y) as i) = 0, y, ['a', 'e', 'i', 'o', 'u'][i] ), extractAll(x,'.'))),
['camión', 'elástico', 'Árbol']) r
┌─r─────────────────────────────┐
│ ['camion','elastico','Árbol'] │
└───────────────────────────────┘
New feature add functions translate(string, from_string, to_string) and translateUTF8(string, from_string, to_string).
These functions replace characters in the original string according to the mapping of each character in from_string to to_string.
SELECT arrayMap(y -> translateUTF8(y,'áéíóúÁÉÍÓÚ','aeiouAEIOU'), ['camión', 'elástico', 'Árbol']) r
r |
-----------------------------+
['camion','elastico','Arbol']|

order records based on a field value on the order of an array of ids in rails

As part of sorting based on priority_id column label names, I have done the below code:
products = Product.where(id: pr_ids).order("priority_id IN(?)",ordered_priority_ids)
The below error is showing:
ActiveRecord::StatementInvalid (PG::SyntaxError: ERROR: syntax error at or near ")"
LINE 1: ...ducts"."id" IN ($2, $3) ORDER BY priority_id IN(?), 4, 2, 3...
^
):
Please help.
Thanks
def self.order_by_priority_ids(ids)
return self.where(:id => 0) if ids.blank?
values = []
ids.each_with_index do |priority_id, index|
values << "(#{priority_id}, #{index + 1})"
end
return self.joins("JOIN (VALUES #{values.join(",")}) as x (priority_id, ordering) ON #{table_name}.priority_id = x.priority_id").reorder('x.ordering')
end
And then you can use:
Product.where(id: pr_ids).order_by_priority_ids(ordered_priority_ids)

ORA-00904: "S"."AIR_TIME": invalid identifier

Why does this code show invalid identifier when sum is used in distance and air_time column?
When sum is not used this statement process successfully but using sum I get error? I need to use sum for this statement.
MERGE INTO FACT_COMPANY_GROWTH F
USING (SELECT DISTINCT TIME_ID, FLIGHT_KEY, AEROPLANE_KEY, SUM(DISTANCE) AS TOTAL_DISTANCE, SUM(AIR_TIME) AS TOTAL_AIRTIME
FROM TRANSFORM_FLIGHT T
INNER JOIN TRANSFORM_AEROPLANE A
ON T.FK_AEROPLANE_KEY = A.AEROPLANE_KEY
INNER JOIN DIM_TIME D
ON D.YEAR = T.YEAR
AND D.MONTH = T.MONTH
GROUP BY TIME_ID, FLIGHT_KEY, AEROPLANE_KEY) S
ON (F.FK1_TIME_ID = S.TIME_ID
AND F.FK2_FLIGHT_KEY = S.FLIGHT_KEY
AND F.FK3_AEROPLANE_KEY = S.AEROPLANE_KEY
)
WHEN MATCHED THEN
UPDATE SET
F.TOTAL_AIRTIME = S.AIR_TIME,
F.TOTAL_DISTANCE = S.DISTANCE,
F.TOTAL_NO_OF_FLIGHTS = S.FLIGHT_KEY,
F.TOTAL_NO_OF_AEROPLANE = S.AEROPLANE_KEY
WHEN NOT MATCHED THEN
INSERT(FACT_ID, FK1_TIME_ID, FK2_FLIGHT_KEY, FK3_AEROPLANE_KEY, TOTAL_DISTANCE, TOTAL_AIRTIME, TOTAL_NO_OF_FLIGHTS, TOTAL_NO_OF_AEROPLANE)
VALUES
(NULL, S.TIME_ID, S.FLIGHT_KEY, S.AEROPLANE_KEY, S.DISTANCE, S.AIR_TIME, S.FLIGHT_KEY, S.AEROPLANE_KEY);
USING(
SELECT DISTINCT
TIME_ID,
FLIGHT_KEY,
AEROPLANE_KEY,
SUM(DISTANCE) AS TOTAL_DISTANCE,
SUM(AIR_TIME) AS TOTAL_AIRTIME
...) S
The problem is at UPDATE SET F.TOTAL_AIRTIME = S.AIR_TIME. There are 5 fields defined in S and none is named AIR_TIME.
UPDATE SET
F.TOTAL_AIRTIME = S.TOTAL_AIRTIME,
F.TOTAL_DISTANCE = S.TOTAL_DISTANCE,

Sort and return dict by specific list value

I found this: Sort keys in dictionary by value in a list in Python
and it is almost what I want. I want to sort exactly as is defined in the above post, i.e., by a specific item in the value list, but I want to return the entire original dictionary sorted by the specified list entry, not a list of the keys.
My last try has failed:
details = {'India': ['New Dehli', 'A'],
'America': ['Washington DC', 'B'],
'Japan': ['Tokyo', 'C']
}
print('Country-Capital List...')
print(details)
print()
temp1 = sorted(details.items(), key=lambda value: details[value][1])
print(temp1)
The error:
{'India': ['New Dehli', 'A'], 'America': ['Washington DC', 'B'], 'Japan': ['Tokyo', 'C']}
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 11, in <module>
temp1 = sorted(details.items(), key=lambda value: details[value][1])
File "C:/Users/Mark/PycharmProjects/main/main.py", line 11, in <lambda>
temp1 = sorted(details.items(), key=lambda value: details[value][1])
TypeError: unhashable type: 'list'
You are trying to use the (key, value) pair from the dict.items() sequence as a key. Because value is a list, this fails, as keys must be hashable and lists are not.
Just use the value directly:
temp1 = sorted(details.items(), key=lambda item: item[1][1])
I renamed the lambda argument to item to make it clearer what is being passed in. item[1] is the value from the (key, value) pair, and item[1][1] is the second entry in each list.
Demo:
>>> details = {'India': ['New Dehli', 'A'],
... 'America': ['Washington DC', 'B'],
... 'Japan': ['Tokyo', 'C']
... }
>>> sorted(details.items(), key=lambda item: item[1][1])
[('India', ['New Dehli', 'A']), ('America', ['Washington DC', 'B']), ('Japan', ['Tokyo', 'C'])]

Resources