How can I check how many words in common two strings have in Bigquery? - string-comparison

I have a list of documents in a Bigquery table. Some of those have very similar names. I need to check each pair of documents, and see how many words they have in common, so I can suggest eliminating one of those.
For instance:
Spreadsheets
Quality Control.xlsx
Product Structure.xlsx
Invoices Sent April.xslx
Invoices Sent March.xlsx
Total Costs April.xlsx
Total Costs March.xlsx
Process of Quality Control.xlsx`
I would have the result like
Spreadsheet |Matching Spreadsheet |Words
Quality Control.xlsx |Process of Quality Control.xlsx |2
Product Structure.xlsx |null |null
Invoices Sent April.xslx |Invoices Sent March.xlsx |2
Invoices Sent March.xlsx |Invoices Sent April.xlsx |2
Total Costs April.xlsx |Total Costs March.xlsx |2
Total Costs March.xlsx |Total Costs April.xlsx |2
Process of Quality Control.xlsx |Quality Control.xlsx |2

Below is example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.spreadsheets` AS (
SELECT 1 AS id, 'Quality Control.xlsx' AS spreadsheet UNION ALL
SELECT 2, 'Product Structure.xlsx' UNION ALL
SELECT 3, 'Invoices Sent April.xslx' UNION ALL
SELECT 4, 'Invoices Sent March.xlsx' UNION ALL
SELECT 5, 'Total Costs April.xlsx' UNION ALL
SELECT 6, 'Total Costs March.xlsx' UNION ALL
SELECT 7, 'Process of Quality Control.xlsx'
)
SELECT
id, s1 spreadsheet, IF(words = 0, NULL, s2) matching_spreadsheet, words
FROM (
SELECT
id, s1,
ARRAY_AGG(STRUCT(s2, words) ORDER BY words DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT t1.id, t1.spreadsheet s1, t2.spreadsheet s2,
( SELECT COUNTIF(word != 'xlsx')
FROM UNNEST(REGEXP_EXTRACT_ALL(t1.spreadsheet, r'\w+')) word
JOIN UNNEST(REGEXP_EXTRACT_ALL(t2.spreadsheet, r'\w+')) word
USING(word)) words
FROM `project.dataset.spreadsheets` t1
CROSS JOIN `project.dataset.spreadsheets` t2
WHERE t1.spreadsheet != t2.spreadsheet
)
GROUP BY id, s1
)
-- ORDER BY id
with result as
Row id spreadsheet matching_spreadsheet words
1 1 Quality Control.xlsx Process of Quality Control.xlsx 2
2 2 Product Structure.xlsx null 0
3 3 Invoices Sent April.xslx Invoices Sent March.xlsx 2
4 4 Invoices Sent March.xlsx Invoices Sent April.xslx 2
5 5 Total Costs April.xlsx Total Costs March.xlsx 2
6 6 Total Costs March.xlsx Total Costs April.xlsx 2
7 7 Process of Quality Control.xlsx Quality Control.xlsx 2

Related

How to Retrieve all the recursive parent of child row in Oracle SQL?

I have a hierarchical data of Category-Products. This is 3 level hierarchy & all the products will always be assigned to the last level. I want do display a drill down report of all the products grouped by Category - Sub Category & Sub Sub Category. Only those categories will be displayed on report for which we have product result. (Result Products are decided by some other criteria, out of scope for this question).
How can I get all the Category data till root level in oracle.
Sample Data
CategoryId Name Parent
1 Clothing NULL
2 Men's Wear 1
3 Shirt 2
4 T-Shirt 2
5 Women's Wear 1
6 Salwar 5
7 Saree 5
8 Electronics NULL
9 Computers 8
10 Mobiles 8
Products table will have Category Id reference. Ex. 3, 4 or 6, 7 etc
I want to retrieve only categories till root level where we have products. I have below query but I am not sure if this is good practice to specify multiple values for START WITH clause. Is there any better option?
SELECT DISTINCT CategoryId,Name,Parent
FROM tblCategory
START WITH CategoryId IN (3,6)
CONNECT BY CategoryId = PRIOR Parent
For above query I have specified only two categories but in real world it could be thousands. Below is result data showing only categories for selected products.
Output:
CategoryId Name Parent
1 Clothing NULL
2 Men's Wear 1
3 Shirt 2
5 Women's Wear 1
6 Salwar 5
So you basically solved your problem. You can list ID's as you did or you can store them somewhere and use in IN subquery, like here for instance:
with tblCategory(CategoryId, Name, Parent) as (
select 1, 'Clothing', null from dual union all
select 2, 'Men''s Wear', 1 from dual union all
select 3, 'Shirt', 2 from dual union all
select 4, 'T-Shirt', 3 from dual union all
select 5, 'Women''s Wear', 1 from dual union all
select 6, 'Salwar', 5 from dual union all
select 7, 'Saree', 5 from dual union all
select 8, 'Electronics', null from dual union all
select 9, 'Computers', 8 from dual union all
select 10, 'Mobiles', 8 from dual ),
ids(cid) as (select 3 from dual union all select 6 from dual)
select distinct categoryid, name, parent
from tblcategory
start with categoryid in (select cid from ids)
connect by categoryid = prior parent
Result:
CATEGORYID NAME PARENT
---------- ------------ ----------
6 Salwar 5
3 Shirt 2
5 Women's Wear 1
2 Men's Wear 1
1 Clothing
You could also produce more readable output like here:
select connect_by_root(categoryid) root,
sys_connect_by_path(name, ' => ') path
from tblcategory
where connect_by_isleaf = 1
start with categoryid in (select cid from ids)
connect by categoryid = prior parent
Result:
ROOT PATH
------ --------------------------------------------------------------------------------
3 => Shirt => Men's Wear => Clothing
6 => Salwar => Women's Wear => Clothing

Informatica scenario on accrual values

Source table:
header_id line_id account_type accrual_tier
101 1 expense NULL
101 2 liability TAX
101. 3 Liability Tax
101 4. Liability GYC
102 1 liability C&B
102 2 expense NULL
102 3 expense NULL
102. 4 ASSET ABC
102. 5 OWNERS PQR
102. 6 liability C&B
102. 7 liability DET
EXPECTED OUTPUT - target table:
header_id line_id account_type accrual_tier
101 1 expense TAX
101 2 liability TAX
101 3 Liability TAX
101 4 liability GYC
102 1 liability C&B
102 2 expense C&B
102. 3 expense C&B
102. 4 ASSET ABC
102. 5 OWNERS PQR
102. 6 liability C&B
102. 7 liability DET
Every header has multiple lines, and every line is either of expense or liability or asset or owners account type. accrual_tier has values for all accounts except for expense. For 'expense' it is null.
The requirement is to populate accrual_tier values for all types by its corresponding accrual_tier value.
EXCEPT for expense type - accrual value will be maximum occurred accrual value for lines which has liability type.
For example, For header 101, maximum occurred accrual value under that header is TAX so to all lines which has expense type under that header will be assigned as TAX.
1) Sort the data by header_id,account_type in descending order (to make sure first record of each header_id is liability)
2) In expression do the following things:
I/p Port: Header_id
Variable port temp=IIF(Header_id=V_Header_id OR ISNULL(V_Header_id) ,IIF(account_type='liablity',accrual_tier,NULL),NULL)
accrual_tier: IIF(Header_id=V_Header_id OR ISNULL(V_Header_id) && account_type='expense',temp,NULL)
Variable port V_Header_id=Header_id
You can use an analytic max() function to find the accrual_tier for the liability record for each header_id:
select header_id
, line_id
, account_type
, max(case when account_type = 'liability' then accrual_tier end)
over (partition by header_id) as accrual_tier
from source_table
order by header_id
, line_id
/
Alternatively we can solve this with a FIRST_VALUE() function, which needs marginally less typing :):
select header_id
, line_id
, account_type
, first_value (accrual_tier ignore nulls)
over (partition by header_id) as accrual_tier
from source_table
order by header_id
, line_id
/
Here is a LiveSQL demo (free Oracle TechNet account required).
Note that both these solutions are dependent on the relational integrity of the source data. If a header has multiple lines where account_type='liability' then you may get unexpected substitutions. Likewise, if there lines where account_type='expense' and accrual_tier is not null both will override the original value.
----------
You have revised your requirement but it's not complete: you appear to be determining the substituted value of accrual_tier for expenses by some criteria which you have not included in your question. So probably this is not the solution you need but it's the best I can do until your post a complete problem:
select header_id
, line_id
, account_type
, case when accrual_tier != 'expense' then
accrual_tier
else
last_value (accrual_tier ignore nulls) over (partition by header_id order by accrual_tier)
end as accrual_tier
from source_table
order by header_id
, line_id
/
You tagged your question [infomatica] but I don't have access to that product; I have assumed working Oracle SQL will solve your problem.

Showing the data of two rows into one row

I have a table with data as
-----------------------------------------------------------------------------
CUSTOMER CSAC CIRCUIT VALUE TOWN POST_CODE
-----------------------------------------------------------------------------
RCE | CSAC125896 | ICUK809605 | 100 MBPS | BASILDON | SS15 5FS
RCE | CSAC125896 | ICUK809605 | 100 MBPS | BASILDON | SS15 6AA
I want the second post code also to be displayed in the same row if the csac values are same like this
-----------------------------------------------------------------------------
CUSTOMER CSAC CIRCUIT VALUE TOWN POST_CODE POST_CODE2
-----------------------------------------------------------------------------
RCE|CSAC125896 |ICUK809605 |100 MBPS | BASILDON | SS15 5FS | SS15 6AA
How can I achieve this result. I have tried using transpose but didn't get the desired result.
SELECT CUSTOMER,
CSAC,
CIRCUIT,
VALUE,
TOWN,
MAX( CASE RN WHEN 1 THEN POST_CODE END ) AS POST_CODE_1,
MAX( CASE RN WHEN 2 THEN POST_CODE END ) AS POST_CODE_2
FROM (
SELECT t.*,
ROW_NUMBER() OVER (
PARTITION BY CUSTOMER, CSAC, CIRCUIT, VALUE, TOWN
ORDER BY POST_CODE
) AS RN
FROM table_name t
)
GROUP BY CUSTOMER,
CSAC,
CIRCUIT,
VALUE,
TOWN;
Output:
CUSTOMER CSAC CIRCUIT VALUE TOWN POST_CODE_1 POST_CODE_2
-------- ---------- ---------- -------- -------- ----------- -----------
RCE CSAC125896 ICUK809605 100 MBPS BASILDON SS15 5FS SS15 6AA
Assuming that you can have more than two rows with the same field values, but different POST_CODE, you can not know in advance the number of columns your query needs to return.
With a slightly different approach, you can try:
select CUSTOMER, CSAC, CIRCUIT, VALUE, TOWN,
listagg(POST_CODE, ', ') within group (order by post_code)
from your_table
group by CUSTOMER, CSAC, CIRCUIT, VALUE, TOWN

query on sales for each day and previous sale for each product also

I want to show a sales of a product and precious sale of that product also if that product is not sale till date than previous sale column should be null. My source table has id, name, sales_date,quantity and unit_price and resultant table should contain id,name,sales_Date,current_sale (which would contain sale on that day) and previous_sale.The previous sale is for individual product not same for all product.
Let's say you have these data in your table:
ID NAME SALES_DATE QUANTITY UNIT_PRICE
---- ----- ----------- -------- --------------
1 p1 2015-07-24 10 5,00
2 p2 2015-07-24 14 10,00
3 p1 2015-07-28 15 4,00
4 p2 2015-07-29 7 11,00
5 p3 2015-07-29 3 2,00
This query, using function lag(), generates output you described (plus column previous_sales_date):
select id, name, sales_date, unit_price*quantity current_sale,
lag(sales_date) over (partition by name order by sales_date) prev_date,
lag(unit_price*quantity) over (partition by name order by sales_date) prev_sale
from sales order by name, sales_date
SQLFiddle demo
I used partition by name, but if in input table "id" means "product id" it's better to use partition by id.
Output:
ID NAME SALES_DATE CURRENT_SALE PREV_DATE PREV_SALE
---- ----- ----------- ------------ ----------- ----------
1 p1 2015-07-24 50
3 p1 2015-07-28 60 2015-07-24 50
2 p2 2015-07-24 140
4 p2 2015-07-29 77 2015-07-24 140
5 p3 2015-07-29 6
Edit: If there are many entries for product per day you need some form of aggregation, most obvious is sum,
like in example below. You can also use min, max, avg.
select name, sales_date, sale current_sale, cnt,
lag(sales_date) over (partition by name order by sales_date) prev_date,
lag(sale) over (partition by name order by sales_date) prev_sale,
lag(cnt) over (partition by name order by sales_date) prev_cnt
from (
select name, trunc(sales_date) sales_date, sum(unit_price*quantity) sale, count(1) cnt
from sales group by name, trunc(sales_date)
)
order by name, sales_date
SQLFiddle demo
I also added columns cnt and prev_cnt - showing number of rows for that product in current and previous days.

Hive: Sum over a specified group (HiveQL)

I have a table:
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over function in conjunction with the sum function but I cant get it to work. My best try:
SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;
Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.
Similar to #VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement.
The HiveQL query is therefore:
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
You could use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to achieve that without a self join.
Code as below:
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
The analytics function sum gives cumulative sums. For example, if you did:
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test
then you would get:
key product_code cost total_costs
1 UK 20 20
1 US 10 30
1 EU 5 35
2 UK 3 3
2 EU 6 9
which, it seems, is not what you want.
Instead, you should use the aggregation function sum, combined with a self join to accomplish this:
select test.key, test.product_code, test.cost, agg.total_cost
from (
select key, sum(cost) as total_cost
from test
group by key
) agg
join test
on agg.key = test.key;
This query gives me perfect result
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;
similar answer (if we use oracle emp table):
select deptno, ename, sal, sum(sal) over(partition by deptno) from emp;
output will be like below:
deptno ename sal sum_window_0
10 MILLER 1300 8750
10 KING 5000 8750
10 CLARK 2450 8750
20 SCOTT 3000 10875
20 FORD 3000 10875
20 ADAMS 1100 10875
20 JONES 2975 10875
20 SMITH 800 10875
30 BLAKE 2850 9400
30 MARTIN 1250 9400
30 ALLEN 1600 9400
30 WARD 1250 9400
30 TURNER 1500 9400
30 JAMES 950 9400
The table above looked like
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
The user wanted a tabel with the total costs like the following
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
Therefor we used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
So far so good.
I want a column more, counting the occurences of each country
key product_code cost total_costs occurences
1 UK 20 35 2
1 US 10 35 1
1 EU 5 35 2
2 UK 3 9 2
2 EU 6 9 2
Therefor I used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;
Sadly this is not working. I get an cryptic error. To exclude an error in my query I want to ask if I did something wrong.
Thanks

Resources