How to execute "with" query locally in ClickHouse? - clickhouse

We meet a case in ClickHouse query which can be simplified as:
SET load_balancing='in_order';
WITH (
SELECT hostName() AS h
FROM system.one
) AS x
SELECT
x,
hostName() AS h,
sum(1) AS pv
FROM cluster({cluster}, system.one)
GROUP BY
x,
h
we get result:
x h pv
cluster_s1_r1 cluster_s1_r1 1
cluster_s1_r1 cluster_s1_r2 1
cluster_s1_r1 cluster_s2_r1 1
cluster_s1_r1 cluster_s2_r2 1
what we want is to execute with query locally, the desired result:
x h pv
cluster_s1_r1 cluster_s1_r1 1
cluster_s1_r2 cluster_s1_r2 1
cluster_s2_r1 cluster_s2_r1 1
cluster_s2_r2 cluster_s2_r2 1
thanks all.

Related

How can I extract columns name from row value of another table(oracle sql)?

I have 2 tables:
table1
no
a
b
c
x1
2
3
4
x2
10
11
12
x3
20
21
22
table2
from_val
in_out
cf_pv
term
a
out
cf
b
b
out
pv
b
c
in
cf
e
Define sum_out is sum of a, b, c in table1 with condition in_out='out' in table2 and sum_cf is sum of a, b, c in table1 with condition cf_pv='cf' in table2.
Shortly, values of from_val in table2 are columns name i.e. a, b, c in table1.
How can I extract and calculate sum_out or sum_cf of every no in Oracle?
sum_out of x1 = 2 + 3
sum_out of x2 = 10 + 11
sum_out of x3 = 20 + 21
sum_cf of x1 = 2 + 4
sum_cf of x2 = 10 + 12
sum_cf of x3 = 20 + 22
Thanks!
'''''''''''''''''''''''''''''''''''''''''''''
in additional,
i want to calculate
sum_out and cf of x1= 2 (=a)
sum_out and cf of x2= 10 (=b)
sum_out and cf of x3= 20 (=c)
Sample data
WITH
tbl_1 AS
(
Select 'x1' "COL_NO", 2 "A", 3 "B", 4 "C" From Dual Union All
Select 'x2' "COL_NO", 10 "A", 11 "B", 12 "C" From Dual Union All
Select 'x3' "COL_NO", 20 "A", 21 "B", 22 "C" From Dual
),
tbl_2 AS
(
Select 'A' "FROM_VAL", 'out' "IN_OUT", 'cf' "CF_PV", 'begin' "TERM" From Dual Union All
Select 'B' "FROM_VAL", 'out' "IN_OUT", 'pv' "CF_PV", 'begin' "TERM" From Dual Union All
Select 'C' "FROM_VAL", 'in' "IN_OUT", 'cf' "CF_PV", 'end' "TERM" From Dual
),
Create CTE (formulas) that generates formulas for IN_OUT = 'out' and For CF_PV = 'cf'
formulas AS
(
Select
CASE WHEN IN_OUT = 'out' THEN IN_OUT END "IN_OUT",
LISTAGG(FROM_VAL, ' + ') WITHIN GROUP (ORDER BY FROM_VAL) OVER(PARTITION BY IN_OUT) "IN_OUT_FORMULA",
CASE WHEN CF_PV = 'cf' THEN CF_PV END "CF_PV",
LISTAGG(FROM_VAL, ' + ') WITHIN GROUP (ORDER BY FROM_VAL) OVER(PARTITION BY CF_PV) "CF_PV_FORMULA"
From
tbl_2
),
IN_OUT
IN_OUT_FORMULA
CF_PV
CF_PV_FORMULA
C
cf
A + C
out
A + B
cf
A + C
out
A + B
B
Another CTE (grid) to connect COL_NO to formulas
grid AS
(
Select
t1.COL_NO,
CASE WHEN f1.IN_OUT = 'out' THEN f1.IN_OUT END "IN_OUT", CASE WHEN f1.IN_OUT = 'out' THEN f1.IN_OUT_FORMULA END "IN_OUT_FORMULA",
CASE WHEN f1.CF_PV = 'cf' THEN f1.CF_PV END "CF_PV", CASE WHEN f1.CF_PV = 'cf' THEN f1.CF_PV_FORMULA END "CF_PV_FORMULA"
From
tbl_1 t1
Left Join
formulas f1 ON(f1.IN_OUT Is Not Null AND f1.CF_PV Is Not Null)
)
COL_NO
IN_OUT
IN_OUT_FORMULA
CF_PV
CF_PV_FORMULA
x1
out
A + B
cf
A + C
x2
out
A + B
cf
A + C
x3
out
A + B
cf
A + C
Main SQL to get the final result
SELECT
g.COL_NO,
g.IN_OUT,
g.IN_OUT_FORMULA,
CASE WHEN g.IN_OUT = 'out' And INSTR(IN_OUT_FORMULA, 'A') > 0 THEN A ELSE 0 END +
CASE WHEN g.IN_OUT = 'out' And INSTR(IN_OUT_FORMULA, 'B') > 0 THEN B ELSE 0 END +
CASE WHEN g.IN_OUT = 'out' And INSTR(IN_OUT_FORMULA, 'C') > 0 THEN C ELSE 0 END "CALC_OUT",
--
g.CF_PV,
g.CF_PV_FORMULA,
CASE WHEN g.CF_PV = 'cf' And INSTR(CF_PV_FORMULA, 'A') > 0 THEN A ELSE 0 END +
CASE WHEN g.CF_PV = 'cf' And INSTR(CF_PV_FORMULA, 'B') > 0 THEN B ELSE 0 END +
CASE WHEN g.CF_PV = 'cf' And INSTR(CF_PV_FORMULA, 'C') > 0 THEN C ELSE 0 END "CALC_CF"
FROM
grid g
INNER JOIN
tbl_1 t1 ON(g.COL_NO = t1.COL_NO)
R e s u l t :
COL_NO
IN_OUT
IN_OUT_FORMULA
CALC_OUT
CF_PV
CF_PV_FORMULA
CALC_CF
x1
out
A + B
5
cf
A + C
6
x2
out
A + B
21
cf
A + C
22
x3
out
A + B
41
cf
A + C
42

How to create a pivot table from a CSV file having subgroups and getting the count of the last values using shell script?

I want to group the columns then form subsequent group getting the count of last column values.
For example main Group A, Subgroup D, J , P and count of P in the subsequent groups as well as the total count of last column.
I am able to form groups but subgroup seems a little hard. Any help is appreciated like how to get this.
Input:
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
Output:
A D J P 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspQ 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK P 2
A E J Q 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK Q 1
B F L R 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspM S 1
C H N T 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspO U 2
&nbsp&nbsp&nbsp&nbspTotal 14
Here's a different approach, a shell script that uses sqlite to calculate the group counts (Requires 3.25 or newer because it uses window functions):
#!/bin/sh
file="$1"
sqlite3 -batch -noheader <<EOF
CREATE TABLE data(c1 TEXT, c2 TEXT, c3 TEXT, c4 TEXT);
.mode csv
.import "$file" data
.mode list
.separator " "
SELECT (CASE c1 WHEN lag(c1, 1) OVER (PARTITION BY c1 ORDER BY c1) THEN ' ' ELSE c1 END)
, (CASE c2 WHEN lag(c2, 1) OVER (PARTITION BY c1,c2 ORDER BY c1,c2) THEN ' ' ELSE c2 END)
, (CASE c3 WHEN lag(c3, 1) OVER (PARTITION BY c1,c2,c3 ORDER BY c1,c2,c3) THEN ' ' ELSE c3 END)
, c4
, count(*)
FROM data
GROUP BY c1, c2, c3, c4
ORDER BY c1, c2, c3, c4;
SELECT 'Total ' || count(*) FROM data;
EOF
Running this gives:
$ ./group.sh example.csv
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
Also a one-liner using datamash, though it doesn't include the fancy output format:
$ datamash -st, groupby 1,2,3,4 count 4 < example.csv | tr , ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
Using Perl
Script
perl -0777 -lne '
s/^(.+?)$/$x++;$kv{$1}++/mge;
foreach my $k (sort keys %kv)
{ $q=$c=$k;
while(length($p) > 0)
{
last if $c=~/^$p/g;
$q=substr($c,length($p)-1);
$p=~s/(.$)//;
}
printf( "%9s\n", "$q $kv{$k}") ;
$p=$k;
}
print "Total $x";
' anurag.txt
Output:
A,D,J,P 1
Q 1
K,P 2
E,J,Q 2
K,Q 1
B,F,L,R 2
M,S 1
C,H,N,T 2
O,U 2
Total 14
$ cat tst.awk
BEGIN { FS="," }
!($0 in cnt) { recs[++numRecs] = $0 }
{ cnt[$0]++ }
END {
for (recNr=1; recNr<=numRecs; recNr++) {
rec = recs[recNr]
split(rec,f)
newVal = 0
for (i=1; i<=NF; i++) {
if (f[i] != p[i]) {
newVal = 1
}
printf "%s%s", (newVal ? f[i] : " "), OFS
p[i] = f[i]
}
print cnt[rec]
tot += cnt[rec]
}
print "Total", tot+0
}
$ awk -f tst.awk file
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
I'll propose a multi stage solution in the spirit of unix toolset.
create a sorted, counted, de-delimited data format
$ sort file | uniq -c | awk '{print $2,$1}' | tr ',' ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
now, the task is removing the longest left common substring from consecutive lines
... | awk 'NR==1 {p=$0}
NR>1 {k=0;
while(p~t=substr($0,1,++k));
gsub(/./," ",t); sub(/^ /,"",t);
p=$0; $0=t substr(p,k)}1'
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
whether it's easier to understand than one script will be seen.
I have not exactly an answer that produces your example output but I'm close enough to dare posting an answer
Now I have an answer that produces exactly your example output... :-)
$ cat ABCD
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
$ awk '{a[$0]+=1}END{for(i in a) print i","a[i];print "Total",NR}' ABCD |\
sort | \
awk -F, '
/Total/{print;next}
{print a1==$1?" ":$1,a2==$2?" ":$2,a3==$3?" ":$3,a4==$4?" ":$4,$5
a1=$1;a2=$2;a3=$3;a4=$4}'
A D J P 1
Q 1
K P 2
E J Q 2
K 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
$
The first awk script iterates on every line and at every line we increment the value of an array, a, element, indexed by the whole line value, next at the end (END target) we loop on the indices of a to print the index and the associated value, that is the count of the times we have that line in the data - eventually we output also the total number of lines processed, that is automatically updated in the variable NR, number of records.
The second awk script either prints the total line and skips any further processing or it compares each field (splitted on commas) with the corresponding field of the previous line and output the new field or a space accordingly.

Hive inner joins wrong result

Two tables table1 and table 2
hive> select * from table1 where dt=20171020;
OK
a 1 1 p 10 20171020
b 2 2 q 10 20171020
c 3 3 r 10 20171020
d 4 4 r 10 20171020
hive> select * from table2 where dt=20171020;
OK
a 1 1 p 10 20171020
b 2 2 t 10 20171020
c 3 3 r 10 20171020
hive> select * from table1 t1
> join table2 t2
> on t1.c1=t2.c1
> where
> t1.dt=20171020 and t2.dt=20171020 and
> t1.c2 <> t2.c2 or t1.c3 <> t2.c3 or t1.c4 <> t2.c4 or t1.c5 <> t2.c5;
Result:
a 1 1 p 20 20171016 a 1 1 p 10 20171015
a 1 1 p 20 20171016 a 1 1 p 10 20171020
b 2 2 q 20 20171016 b 2 2 t 10 20171015
b 2 2 q 20 20171016 b 2 2 t 10 20171020
c 3 3 r 20 20171016 c 3 3 r 10 20171015
c 3 3 r 20 20171016 c 3 3 r 10 20171020
b 2 2 q 10 20171020 b 2 2 t 10 20171015
b 2 2 q 10 20171020 b 2 2 t 10 20171020
a 19 19 p 20 20171019 a 1 1 p 10 20171015
a 19 19 p 20 20171019 a 1 1 p 10 20171020
I want following row because this row got changed,how hive joins in the above code?
b 2 2 q 10 20171020
Try this.Your join should be on date as well.
SELECT *
FROM table1 t1
JOIN table2 t2
ON t1.c1 = t2.c1
AND t1.dt = t2.dt
WHERE t1.dt = 20171020
AND ( t1.c2 <> t2.c2
OR t1.c3 <> t2.c3
OR t1.c4 <> t2.c4
OR t1.c5 <> t2.c5 );

Prefix-function computation in Knuth-Morris-Pratt Algorithm

So for the following sub string
1 2 3 4 5 6 7 8 9 10 11
a b c d a b c d a b x
Which is the prefix function? Me and one of my friends computed it and we have different results, mine is:
a b c d a b c d a b x
0 0 0 0 1 2 3 4 5 6 2
and his:
a b c d a b c d a b x
0 0 0 0 1 2 3 4 1 2 0
If I am wrong, why is that?
The prefix table should be:
a b c d a b c d a b x
0 0 0 0 1 2 3 4 5 6 0
so both versions given are not right.
For the last entry of your table
a b c d a b c d a b x
0 0 0 0 1 2 3 4 5 6 2
^
|
this one
to be correct, the suffix of length 2 of a b c d a b c d a b x which is b x would also have to be its length 2 prefix, which is a b instead.
In case of entries different from zero in the prefix table corresponding prefixes and suffixes have been marked in the table below:
a 0
a b 0
a b c 0
a b c d 0
a b c d a 1
-
=
a b c d a b 2
---
===
a b c d a b c 3
-----
=====
a b c d a b c d 4
-------
=======
a b c d a b c d a 5
---------
=========
a b c d a b c d a b 6
-----------
===========
a b c d a b c d a b x 0
My KMP function in java:
public int[] KMP(String val) {
int i = 0;
int j = -1;
int[] result = new int[val.length() + 1];
result[0] = -1;
while (i < val.length()) {
while (j >= 0 && val.charAt(j) != val.charAt(i)) {
j = result[j];
}
j++;
i++;
result[i] = j;
}
return result;
}
Result for prefix arrays:
[-1, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 0]
Neither of your answers are correct. The prefix function or partial match table would be the following:
a b c d a b c d a b x
0 0 0 0 1 2 3 4 5 6 0
Your answer was correct upto index 10. But in the last index you have done something wrong. The reason why value of index 11 of partial match table would 0 is because there are no proper prefix which matches any proper suffix of the string upto index 11. Because all proper suffixes at this position will end with x and no proper prefix at this position will end with x.
If you have problem understanding what actually prefix function or partial index table means you can take a look into this document. It has a very good explanation. Hope it helps.
both of your answers are wrong. correct one will be
a b c d a b c d a b x
0 0 0 0 1 2 3 4 5 6 0

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Resources