using something like group_concat on clickhouse - clickhouse

I'm trying to get a result of
types
name
fruit
banana, apple, guaba, strawberry
from a table with database like
table: fruits
types
name
fruit
banana
fruit
apple
fruit
guaba
fruit
strawberry
I know with MySQL I can use group_concat to get the result I want by using
SELECT group_concat(name), types FROM fruits
I have done my research and people recommend me to use groupArray in clickhouse to obtain similar result but this is not what I want. Because when I use
SELECT groupArray(name), types FROM fruits GROUP BY types
it gives me result of
types
name
fruit
apple, banana, banana, strawberry,strawberry,strawberry, guaba,guaba,guaba,guaba
the order of groupArray is mixed up and I can't seem to find an answer to fix the order :(
is there any way in clickhouse where we can get array of results in order? and why are there duplicated results?
I can't use groupUniqArray because sometimes my result should be
banana, apple, guaba, strawberry, strawberry (if strawberry is there twice in DB)
how do I keep the duplicated data without having it multiplied in order???
I have data of input_time and key so my table is something like
types
name
input_time
key
fruit
banana
01:01
01
fruit
apple
01:02
01
fruit
guaba
01:03
02
fruit
strawberry
01:04
03
fruit
strawberry
01:05
04
and forgetting about 'types', I want to get result of grouped names in DB saved order (input_time order) group by key. How should I change my query??
I've tried
SELECT groupArray(name), key FROM fruits GROUP BY key ORDER BY input_time
but it does not give me the result I want..

use order by in sub-query. The order in the array is the same as the order of rows at a previous stage of query pipeline
select types, arrayCompact(groupArray(name)) names, length(names) len from(
select types, name from fruit order by types, input_time)
group by types
┌─types─┬─names───────────────────────────────────┬─len─┐
│ fruit │ ['banana','apple','guaba','strawberry'] │ 4 │
└───────┴─────────────────────────────────────────┴─────┘
https://clickhouse.com/docs/en/sql-reference/functions/array-functions/#arraycompact
Removes consecutive duplicate elements from an array. The order of result values is determined by the order in the source array.

Related

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

ElasticSearch sort by id's in array

Is there a way to sort some elasticsearch response in the same direction, which I am posting an array with ids?
Example: array[23,45,67] and the results should be sort in the same way like the id's are: first come all rows with ID 23, after that all rows with ID 45 and at the end all rows with ID 67 ?
Thanks
Nik
You can use scripting in sort or the other option is to use bool - should query where you boost documents with these values.

Oracle query with two paterns in one expression

Input:
TABLE NAME: SEARCH_RECORD
Column A Column B Column C Column D
ID CODE WORD CODE/WORD
--------------------------------------------
123 666Ani RAT 666Ani/RAT
124 777Cae CAT 777Cae/CAT
I need a query to check as a LIKE case
if i search with column B like '%6A' or column C '%A%' it will give result
suppose i want to get the like based on the column D search
**User will search like '%6A%'/'%AT%' (always / will be given by user)**
Expected output:
666Ani/RAT
so, I need a query for the above to get the ID as output (CASE query is preferable)
Need you valuable suggestion
.
It can't be done with simple like.
It should work if the pattern look like '%6A%/%AT%'. It is a valid pattern.
So, you can write: columnD like '%6A%/%AT%' or columnD like first_pattern||'/'||second_pattern if the come from as different variables.
Another approach, if you know for sure that there is only a /(you can check how many they are), may be to use two likes using substr to get first and then second part of the search string.
where
columnB like substr(match_string, 1, instr(match_string,'/'))
and
columnC like substr(match_string, instr(match_string,'/')+1)

VS 2010 reporting services grouping

I want to load the list of the groups as well as data into two separate datatables (or one, but I don't see that possible). Then I want to apply the grouping like this:
Groups
A
B
Bar
C
Car
Data
Ale
Beer
Bartender
Barry
Coal
Calm
Carbon
The final result after grouping should be like this.
*A
Ale
*B
*Bar
Bartender
Barry
Beer
*C
Calm
*Car
Carbon
Coal
I only have a grouping list, not the levels or anything else. And the items falling under the certain group are the ones that do start with the same letters as a group's name. The indentation is not a must. Hopefully my example clarifies what I need, but am not able to name thus I am unable to find anything similar on google.
The key things here are:
1. Grouping by a provided list of groups
2. There can be unlimited layers of grouping
Since every record has it's children, the query should also take a father for each record. Then there is a nice trick in advanced grouping tab. Choosing a father's column yields as many higher level groups as needed recursively. I learnt about that in http://blogs.microsoft.co.il/blogs/barbaro/archive/2008/12/01/creating-sum-for-a-group-with-recursion-in-ssrs.aspx
I suggest reporting from a query like this:
select gtop.category top_category,
gsub.category sub_category,
dtab.category data_category
from groupTable gtop
join groupTable gsub on gsub.category like gtop.category + '%'
left join dataTable dtab on dtab.category like gsub.category + '%'
where len(gtop.category) = 1 and
not exists
(select null
from groupTable gchk
where gsub.category = gtop.category and
gchk.category like gsub.category + '%' and
gchk.category <> gsub.category and
dtab.category like gchk.category + '%')
- with report groups on top_category and sub_category, and headings for both groups. You will probably want to hide the sub_category heading row when sub_category = top_category.

How do I select the max child value in ActiveRecord?

I'm not even sure how to word this, so an example:
I have two models,
Chicken
id
name
EggCounterReadings
id
chicken_id
value_on_counter
timestamp
I don't always record a count for every chicken when I do counts.
Using ActiveRecord how do I get the latest egg count per chicken?
So if I have 1 chicken and 3 counts, the counts would be 1 today, 15 tomorrow, and 18 the next day. That chicken has laid 18 eggs, not 34
UPDATE: Found exactly what I was trying to do in MySQL. Find "The Rows Holding the Group-wise Maximum of a Certain Column". So I need to .find_by_sql("SELECT * FROM (SELECT * FROM EggCounterReadings WHERE <conditions> ORDER BY timestamp DESC) GROUP BY chicken_id")
Given your updated question, I've changed my answer.
chicken = Chicken.first
count = chicken.egg_counter_readings.last.value_on_counter
If you don't want the latest record, but the largest egg yield, then try this:
chicken = Chicken.first
count = chicken.egg_counter_readings.maximum(value_on_counter)
I believe that should do what you want.

Resources