In PIG how to remove similar values - hadoop

in my pig script i have a column for country1 and country2 and an id. In my country field, some of the values are similar like below. How do I filter out similar values that have at least 2 consecutive of the same characters?
Ex:
a = load file
a = generate id, country1, country2
output:
id1, us, usa
id2, gb, gba
id3, in, ind
id4, in, usa
expected output:
id4, in, usa

Use SUBSTRING to get the first two characters of the 3rd column and compare that with the 2nd column value.
B = FILTER A BY (LOWER(A.$1) != SUBSTRING(LOWER(A.$2),0,2));
DUMP B;

Related

How to Flatten and get expected output shown below from pig after Group by

Sample Date:
ID marks date
12345 12 20210204
12345 13 20210204
12345 2 20210204
Input:
(12345,{(12345,12,20210204),(12345,13,20210204),(12345,2,20210204)})
Output needed:
(12345,27,20210204)
Second element is the aggregated value.
Help is Appreciated
output = FOREACH input GENERATE
group AS ID,
SUM(sample.marks) AS mark_sum,
MIN(sample.date) AS first_date;
You may need to tweak based on your relation and field names. You might also want to group by the date field too if these are all the same.

Oracle SQL: Select rows from table A with fallback to joined table A and B. (union, group by,...)

The requirement may seem a bit odd, but bear with me: Lets say I have a list of my employees like this:
pid name
-------------------------
1 Smith-Gordon
2 Hansen
3 Simpson
And a table of previous names (if e.g. Mrs Smith-Gordon and Mr Hansen had one or more different names before they were married, respectively), employeehist:
pid oldname
-------------------------
1 Smith
2 Taylor
2 Baker
What I want now is to be able to search for names and get results from both tables like this:
a) Search for "Simpson%" -> Get a result like "3, Simpson"
b) Search for "Hansen%" -> Get a result like "2, Hansen"
c) Search for "Taylor%" -> Get a result like "2, Hansen, matched on previous Taylor"
d) Search for "Smith%" -> Get a result like "1, Smith-Gordon"
In other words, I want the current record, plus the old name if that was where the pertinent match occurred.
What I tried so far:
1) Naively join the history to the current employees: The searches b), c) and d) will always contain something in the oldname column, so I can't tell where the match occurred. I also get duplicate hits for Mr Hansen.
2) I tried to UNION a first select on employees (containing a dummy NULL AS oldname) with a second select joining employeehist with employees which will return me a nice hit for search b) without an oldname and one with an oldname for c), but now I predictably get duplicates in d).
Any thoughts?
You can use the following query with a parameter:
SELECT e.pid,
CASE
WHEN e.name LIKE :search_key THEN e.name
WHEN eh.oldname LIKE :search_key THEN e.name || ' matched on previous ' || eh.oldname
END
FROM employees e
LEFT JOIN employeehist eh on (e.pid = eh.pid)
WHERE e.name LIKE :seach_key OR eh.oldname LIKE :search_key
I have come up with this solution:
SELECT * FROM ( /* (3) outer filter query */
SELECT e.pid, e.name, /* (1) query combining current and matching old names */
CASE
WHEN e.name LIKE :search_key THEN 'Y'
ELSE 'N'
END AS primary_match,
(
SELECT oldname /* (2) subquery that gives me one or no matching old name */
FROM employeehist eh
WHERE eh.pid = e.pid
AND eh.oldname LIKE :search_key
AND ROWNUM=1
)
FROM employees e
) combined
WHERE combined.primary_match = 'Y' OR combined.oldname IS NOT NULL;
There's one primary select (1) that gets me all current ids and names, and adds a CASE column whether the name matched. Additionally, it runs a subquery (2) that gets me one matching old name (also if there are several, or none if none). With that on hand I can use an outer select (2) that will filter away rows with no matches.
This would return e.g. for search key "Smith%"
pid | name | primary_match | oldname
1 | Smith-Gordon | Y | Smith
or for "Taylor%"
pid | name | primary_match | oldname
2 | Hansen | N | Taylor
I'm not sure how elegant it is, but it works as I want:
I get one result per matching current pid, no matter how many old names that pid has, matching or not. No duplicates.
I can distinguish between results that matched on the current name and those that ("only" or "also") matched on old names.
I don't need to define my matching condition twice because it gets rolled into that CASE column and I can filter on that.
There's obviously room for improvement: The subquery (2) could be made to return an aggregate of all matching old names (or the newest or oldest, I have a column for that).
But this works for me.
I have found a better solution than my previous one. My problem was that I couldn't GROUP BY pid and "squash" differing oldname rows. I'm quite sure I remember that this was possible in MySQL, but Oracle always ever gave me "979: not a GROUP BY expression". Strict but fair.
The solution is apparently to provide Oracle with a strategy how to deal with those rows:
SELECT pid, name,
MIN(oldname) KEEP (DENSE_RANK FIRST ORDER BY oldname NULLS FIRST) as oldname
/*(3) outer select combines current and old hits, and "squashes" duplicates, preferring current hits where available*/
FROM (
SELECT e.pid, e.name, null AS oldname /*(1) hits in current names*/
FROM employees e
WHERE e.name LIKE :search_key
UNION ALL
SELECT e.pid, e.name, eh.oldname /* (2) hits in old names*/
FROM employeehist eh
JOIN employees e ON e.pid = eh.pid
WHERE eh.oldname LIKE :search_key
) combined
GROUP BY pid, name;
The idea is simple: Run a query (1) that gives all matches in current names (plus a dummy "oldname" column with NULLs), then a query (2) that gives all matches in old names (complete with their joined current names to display). Then simply combine those, and remove the duplicates by pid (and name, because Oracle, but that's identical by definition) giving preference to rows where oldname is NULL.
This would return e.g. for search key "Smith%"
pid | name | oldname
1 | Smith-Gordon | NULL
which is exactly what I want. If there's a pid with a current and an old match, I don't care about the old one. Or for "Taylor%":
pid | name | oldname
2 | Hansen | Taylor
This query also appears to be roughly 10 times faster than my other solution - I guess because it avoids subqueries that depend on the current pid.
So the only odd thing is that I need to use MIN(oldname) instead of some form of identity. I get that Oracle needs an aggregate function here, but the whole point of the KEEP ... FIRST exercise is to only have one row anyway, no?
But it works, and it's fast, so I won't complain.

Where two or more values match condition?

I have been asked this question;
You list county names and the surnames of the representatives if the representatives in the counties have the same surname.
and I have the following tables;
***REPRESENTATIVE***
REPI SURNAME FIRSTNAME COUNTY CONS
---- ---------- ---------- ---------- ----
R100 Gorege Larry kent CON1
R101 shneebly john kent CON2
R102 shneebly steve kent CON3
I cant seem to figure out the correct way to ask Orical to display a surname that exists more then twice and the surnames are in the same country.
I know how to ask WHERE something = something, but that's doesn't ask what I want to know.
It sounds like you want to use the HAVING clause after doing a GROUP BY
SELECT surname, county, count(*)
FROM you_table
GROUP BY surname, county
HAVING count(*) > 1;
If you really mean "more than twice" as you wrote, none of the data you'd want HAVING count(*) > 2 but then none of your sample data would be returned.
In words, this SQL statement says
Group the data into buckets by surname and county. Each distinct combination of surname and county is a separate bucket.
Count the number of rows in each bucket
Return those buckets where there are at least two rows

Substring inside string

Suppose this is my table:
ID STRING
1 'ABC'
2 'DAE'
3 'BYYYYYY'
4 'H'
I want to select all rows that have at least one of the characters in the STRING column somewhere in another row's STRING variable.
For example, 1 and 2 have an A in common and 1 ad 3 have a B in common, but 4 does not have any characters in common with any of the other rows. So my query should return only the first three lines.
I don't need to know with which line it matched.
Thanks!
#A.B.Cade : Good solution but could be done without any distinct nor join.
SELECT * FROM test t1
WHERE EXISTS
(
SELECT * FROM test t2
WHERE t1.id<>t2.id AND
regexp_like(t1.string, '['|| replace(t2.string, '.[]', '\.\[\]')||']')
)
The query won't compare the string with extra rows since it'll stop the comparison as soon as 1 match is found for the current row...
See fiddle.
#GolezTrol's answer is a good one, but here is another approach:
select distinct t1."ID", t1."STRING"
from table1 t1, table1 t2
where t1."ID" <> t2."ID"
and regexp_like(t1."STRING", '['|| t2."STRING"||']')
First take a cartessian product of the table
Then make sure your not comparing the same string to itself
then create a regexp from one string for comparing to the other - [<string1>] means that the string must contain one of the letters in the [ ] which are all from string1
Here is a fiddle
Like this:
select distinct
id, name
from
(select distinct
x.id,
x.NAME,
length(x.NAME) as leng,
substr(x.name, level, 1) as namechar
from
YourTable x
start with
level = 0
connect by
level <= length(x.name)) y
where
exists
(select
'x'
from
YourTable z
where
instr(z.name, y.namechar) > 0 and
z.id <> y.id)
order by
id
What it does:
First, (inner select) use the table with a number generator that returns a number for each letter in the name. Now each record in YourTable is returned Length(Name) times, each with another number. That generated number is used to isolate that letter (substr).
Then (subselect in top level where clause) check if records exist that contain that isolated letter. Distinct is needed, because records are returned more than once if more than one letter matches. You could add namechar to the outer select field list to see the letter that match.

Select all rows from SQL based upon existence of multiple rows (sequence numbers)

Let's say I have table data similar to the following:
123456 John Doe 1 Green 2001
234567 Jane Doe 1 Yellow 2001
234567 Jane Doe 2 Red 2001
345678 Jim Doe 1 Red 2001
What I am attempting to do is only isolate the records for Jane Doe based upon the fact that she has more than one row in this table. (More that one sequence number)
I cannot isolate based upon ID, names, colors, years, etc...
The number 1 in the sequence tells me that is the first record and I need to be able to display that record, as well as the number 2 record -- The change record.
If the table is called users, and the fields called ID, fname, lname, seq_no, color, date. How would I write the code to select only records that have more than one row in this table? For Example:
I want the query to display this only based upon the existence of the multiple rows:
234567 Jane Doe 1 Yellow 2001
234567 Jane Doe 2 Red 2001
In PL/SQL
First, to find the IDs for records with multiple rows you would use:
SELECT ID FROM table GROUP BY ID HAVING COUNT(*) > 1
So you could get all the records for all those people with
SELECT * FROM table WHERE ID IN (SELECT ID FROM table GROUP BY ID HAVING COUNT(*) > 1)
If you know that the second sequence ID will always be "2" and that the "2" record will never be deleted, you might find something like:
SELECT * FROM table WHERE ID IN (SELECT ID FROM table WHERE SequenceID = 2)
to be faster, but you better be sure the requirements are guaranteed to be met in your database (and you would want a compound index on (SequenceID, ID)).
Try something like the following. It's a single tablescan, as opposed to 2 like the others.
SELECT * FROM (
SELECT t1.*, COUNT(name) OVER (PARTITION BY name) mycount FROM TABLE t1
)
WHERE mycount >1;
INNER JOIN
JOIN:
SELECT u1.ID, u1.fname, u1.lname, u1.seq_no, u1.color, u1.date
FROM users u1 JOIN users u2 ON (u1.ID = u2.ID and u2.seq_no = 2)
WHERE:
SELECT u1.ID, u1.fname, u1.lname, u1.seq_no, u1.color, u1.date
FROM users u1, thetable u2
WHERE
u1.ID = u2.ID AND
u2.seq_no = 2
Check out the HAVING clause for a summary query. You can specify stuff like
HAVING COUNT(*) >= 2
and so forth.

Resources