how to join and find value in PIG? - hadoop

* hwo to combine these two tables and check the id's which are greater than 1600 for NDAKOTA regions*
1 alaska robert
2 boston lilly
3 NDakota Michael
4 NDakota Will
5 NDakota Mark
1A 1 09/09/2012 1200
2A 2 8/9/2016 3400
3B 3 4/5/2016 2300
customers = LOAD '/home/vis/Documents/customers' using PigStorage(' ') AS(cust_id:int,region:chararray,name:chararray);
sales = LOAD '/home/vis/Documents/sales' using PigStorage(' ')
AS(sales_id:int,cust_id:int,date:datetime,amount:int);
salesNA = FILTER customers BY region =='NDakota';
joined = JOIN sales BY cust_id,salesNA BY cust_id;
grouped = GROUP joined BY cust_id;
summed= FOREACH grouped GENERATE GROUP,SUM(sales.amount);
bigSpenders= FILTER summed BY 1$>1600;
DUMP sorted;
recieving errors as

From Apache Pig Docs
Use the disambiguate operator ( :: ) to identify field names after
JOIN, COGROUP, CROSS, or FLATTEN operators.
Below code snippet should be suffice to achieve the objective, let me know if you see any issues.
customers = LOAD 'customers.txt' using PigStorage(' ') AS(cust_id:int,region:chararray,name:chararray);
sales = LOAD 'sales.txt' using PigStorage(' ') AS(sales_id:chararray,cust_id:int,date:chararray,amount:int);
custNA = FILTER customers BY region =='NDakota';
joined = JOIN sales BY cust_id,custNA BY cust_id;
req_data = FILTER joined BY amount > 1600;
DUMP req_data;

Related

CTX index in oracle is quite slow?

I am using CTX index in oracle column .
I have two table
xyz have 6 million record.
abc has 100k record
What i want to achieve is this.
Scenario - I have to union them and then i need the data in order by of date . And then i need pagination record on them.
select page_loc, blurb_id, article_id, hit_date,val_rank from (
SELECT REGEXP_REPLACE(regexp_replace(page_loc, '^.*mercer\.com'),'\?.*$') AS page_loc,
blurb_id, article_id, hit_date,RANK() OVER (ORDER BY hit_date DESC) AS val_rank
from (
SELECT page_loc, blurb_id, article_id, hit_date
FROM abc u INNER JOIN person p ON u.person_id=p.person_id
WHERE (CONTAINS(u.page_loc,v_pattern) > 0 )
UNION ALL
SELECT e page_loc, blurb_id, article_id, hit_date
FROM xyz u INNER JOIN person p ON u.person_id=p.person_id
WHERE ( CONTAINS(u.page_loc,v_pattern) > 0)
) p
left join
company c on c.company_id = p.company_id
) a
where val_rank between (v_page_no -1) * 10 and v_page_no * 10
v_pattern --> searching text
v_page_no --> page count
It take long time when i have record for some v_pattern more e.g 100k,
for example a word have 2 million record. it take more than 20 min.
Any idea what can i do . or it is not possible with oracle.
NOTE :- when i analysis it i found order by and pagination making it slow . How can i overcome that

Exclude Debit Transactions

I have a need to select records from transactions table excluding certain transactions.
Bellow is a sample output of my tables.
TB_ACCOUNTS
CUSTCD ACCTNO PRDCD
100 10001 SATF
100 10002 SATF
200 10003 CUS
TB_TRANSACTIONS
TXNNO TXDATE ACCTNO CUSTOMER_NO TXAMT CASHFLOWTYPE
TX1 18-Jul-16 10001 100 5000 CR
TX2 18-Jul-16 10002 100 5000 DR
TX3 18-Jul-16 10003 200 3000 DR
TX4 18-Jul-16 10001 100 3000 CR
I want to select credit transactions where PRDCD is 'SATF' and exclude transfers between own accounts. For example customer 100 PRDCD is SATF and has two accounts. I want my select query to exclude credit transaction of amount 5000 because the debit account also belongs to the same customer. But include credit transaction of 3000 amount because the debit account is of different customer and type is not SATF.
So far I got bellow query but the output I'm getting is completely wrong.
select * from TB_TRANSACTIONS AB inner join TB_ACCOUNTS AC
on AB.ACCTNO=AC.ACCTNO
where AB.CASHFLOWTYPE='CR'
and AC.PRDCD='SATF'
and AB.TXNNO=
(select TXNNO from TB_TRANSACTIONS A, TB_ACCOUNTS B
where A.ACCTNO=B.ACCTNO
and A.TXAMT=AB.TXAMT
and A.CASHFLOWTYPE='DR'
and B.PRDCD=AC.PRDCD)
How to achieve the desired result?
You could use something like here:
select txnno, txdate, acctno, customer_no, txamt, cashflowtype, prdcd
from (
select t.*, a.prdcd,
count(distinct customer_no) over (partition by txdate, txamt) cnt
from tb_transactions t join tb_accounts a on t.acctno = a.acctno )
where cnt = 2 and cashflowtype = 'CR' and prdcd = 'SATF'
Here I assumed that txamt is unique for each date. I strongly suspect that this may be wrong assuption so be warned.
But there is nothing except this column that tells us that two rows belongs to the same operation.
In first query I used count() in analytic version. There are also possible other solutions, for instance (not) exists:
select *
from tb_transactions t join tb_accounts a on t.acctno = a.acctno
where
not exists (
select 1 from tb_transactions tt join tb_accounts ta on tt.acctno = ta.acctno
where tt.txdate = t.txdate and tt.txamt = t.txamt
and tt.cashflowtype = 'DR' and tt.customer_no = t.customer_no)
and cashflowtype = 'CR' and prdcd = 'SATF'
(sorry for any language mistakes)

SQL query multiple joins not working

I am using oracle database and trying to run the following query but it gives the error:
"ERROR at line 17: ORA-00904: "FRH"."NS": invalid identifier"
What is the problem with it?
Following is the query:
SELECT *
FROM
(SELECT *
FROM ROOMS R
WHERE R.Prix<'50') FRM
JOIN
(SELECT *
FROM
(SELECT *
FROM HOTELS H
WHERE H.CatH=2) FH
JOIN
(SELECT *
FROM RESORTS R
WHERE TypeS='montagne') FR
ON FH.NS=FR.NS) FRH
ON (FRH.NS=FRM.NS AND FRH.NH=FRM.NH);
Thanks in advance
You have way too many nested selects here. Your query can be simplified to:
SELECT *
FROM rooms rm
JOIN hotels ht ON ht.ns = rm.ns AND ht.nh = rm.nh
JOIN resorts rs ON rs.ns = ht.ns
WHERE rm.prix < 50
AND ht.cath = 2
AND ss.types = 'montagne';
I am not entirely sure which tables need to be joined using just the ns column and which need both the ns and nh column because you have obfuscated your query so much and did not show us the table definitions.
Alternatively you can move the restrictions on the joined tables into the join condition. This isn't necessary for the inner joins you are using, but could be needed if you ever want to change that to an outer join:
SELECT *
FROM rooms rm
JOIN hotels ht ON ht.ns = rm.ns AND ht.nh = rm.nh AND ht.cath = 2
JOIN resorts rs ON rs.ns = ht.ns AND rs.types = 'montagne'
WHERE rm.prix < 50;
You should also not compare numbers and strings. Assuming rooms.prix is a number column, the condition R.Prix<'50' is wrong. You need to compare the number to a number r.prix < 50

In Elasticsearch, how can I establish join query with conditions and later perform percentile and count functions?

I have set of tables in my data base like table A which has set of set of categories , table B set of repositeries. A and B are related by categoryid. And then table C which has set of properties for a repoId. Table C and A are associated with repoId.
Table C can have multiple values for a repoId.
The data in C table is like a property say a number string like 12345XXXX (max data of 10 characters) and I have to find the top 6 matching characters of a particular value in table C and the count of repoIds associated with those top 6 value for a particular data in table A (categoryid).
Table A(set of categories ) ---------> Table B (set of repositories, associated with A with categoryid)---------> Table V (set of FMProperties against a repoId)
Now currently, this has been achieved by using joins and substring queries on these tables and it is very slow.
I have to achieve this functionality using Elastic search. I dont have clear view how to start?
Do I create separate documents / indexes for table A , B and C or fetch the info using sql query and create a single document.
And how we can apply this analytics part explained above.
I am very new and amateur in this technology but I am following the tutorials provided at elasticsearch site.
PFB the query in mysql for this logic:-
select 'fms' as fmstype, C.fmscode as fmsCode,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175
group by C.fmscode
order by countOffms desc
limit 1)
UNION ALL
(select 'fms6' as fmstype, t1.fmscode, t2.countOffms from tableC t1
inner join
(
select substring(C.fmscode,1,6) as first6,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175 and length(C.fmscode) = 6
group by substring(C.fmscode,1,6) order by countOffms desc
limit 1 ) t2
ON
substring(t1.fmscode,1,6) = t2.first6 and length(t1.fmscode) = 6
group by t1.fmscode
order by count(t1.fmscode) desc
limit 1)

Tsql: what is the best way to retrieve some records with a specific criteria?

I have a table (Cars) which saves some characteristics car like EngineNo, LastProductionStepId, NodyNo, ...
Besides, I have another table (CarSteps) which saves all steps that a specific car should pass during its manufacturing proccess like Engine Assigning(Id = 2), Engraving(3), PrePaint(4), Paint(5), AfterPaint(6), Confirmation(7), Delivery(8)
I would like to get all cars that are between PrePaint and Confirmation at the moment:
select cr.Id, cr.BodyNo, cr.LastStepId
from Cars cr WITH (NOLOCK)
inner join CarSteps steps WITH (NOLOCK) on cr.Id = trace.CarId and cr.LastStepId=trace.StepId
where
cr.LastStepId >= 4
AND cr.[Status] = 1
AND steps.[Status] = 1
AND not exists ( select *
from CarSteps steps1 WITH (NOLOCK)
where steps1.CarId = cr.Id
AND steps1.StepId >= 7 AND steps1.Status = 1
)
Because CarSteps has many records (44 million) the query is slow .
what is your opinion? is there any better way to get those cars?
Looking at your query I see a join from Cars to CarSteps, I see you joining to trace.CarId and trace.StepId is. trace is not defined in your query.
from
Cars cr WITH (NOLOCK) inner join
CarSteps steps WITH (NOLOCK) on
cr.Id = trace.CarId and
cr.LastStepId=trace.StepId
I may be able to help if i have better understand the fully query.
What does the execution plan reveal?

Resources