Extract values from sql script output which are not equal to zero - bash

I am basically doing row counts of tables with same names between 2 different databases.
Our sql script is something like this:
select (select count(1) from source.abc#remotedb) - (select count(1) from target.bcd) from dual;
we have almost 2000 scripts similar to above.
and the output is like following:
select count(1) from source.abc#remotedb) - (select count(1) from target.abc
----------------------------------------------------------------------------
0
select count(1) from source.opo#remotedb) - (select count(1) from target.opo
----------------------------------------------------------------------------
26
select count(1) from source.asd#remotedb) - (select count(1) from target.asd
----------------------------------------------------------------------------
-95
Now using using bash/shell scripting i want to print the output to a separate file of only those three lines where the numeric value is NOT equal to 0.
Example:
$ cat final_result.txt
select count(1) from source.opo#remotedb) - (select count(1) from target.opo
----------------------------------------------------------------------------
26
select count(1) from source.asd#remotedb) - (select count(1) from target.asd
----------------------------------------------------------------------------
-95

grep -E -B1 '\-{0,1}[1-9][0-9]*' fileinput > final_result.txt
-B1: one line Before the matched line

Maybe something like
egrep -B3 '^\-*[1-9]+$' fileinput > final_result.txt

I would do like this:
cat result_of_sql | grep -v '^$' | paste - - - | awk -F"\t" '$3!=0{print $1}'
Where
grep -v '^$' get rid of empty lines
paste - - - aggregate every 3 lines into 1 with tabs
awk -F"\t" '$3!=0{print $1}' awk magic: print the expected result.

Related

Delete all string after last matching words

I want to delete all string after last where condition
My input is
DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE T1.a=T2.b)
Want the output as
DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE
i have tried it with sed command as
output=`echo DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2
WHERE T1.a=T2.b) | sed -n -e 's/[Ww][Hh][Ee][Rr][Ee].*//p'`
but i got output as
DELETE FROM abc T1
With sed, using BRE:
sed 's/\(.*WHERE\).*/\1/;s/\(.*where\).*/\1/;' <<< "DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE T1.a=T2.b)"
With GNU sed, using the i (for case insensitive) modifier:
sed 's/\(.*where\).*/\1/i' <<< "DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE T1.a=T2.b)"
or the alternation | operator:
sed -r 's/(.*(where|WHERE)).*/\1/i' <<< "DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE T1.a=T2.b)"
output:
DELETE FROM abc T1 WHERE EXISTS (SELECT 1 FROM cdef T2 WHERE

Oracle - Find Best Match between Two tables

My team and I are curious to determine the best way we can match Two Different sets of data. There are no keys that can be joined on as this data as is coming from two separate sources that know nothing about each other. We import this data into two oracle tables and once that is done we can begin to look for matches.
Both Tables contain a full list of Properties(As in Real estate). We are needing to match up the Properties in Table1 to any potential matching Properties found in Table2. For each and every record in Table1 search Table2 for a potential match and determine the probability of the match. My team and I have decided that the best way to do this would be to compare the Address fields from each of the two tables.
The one catch is that Table1 provides the Address in a Parsed format and allocates the address number, address Street and even the Address_type into separate columns while Table2 only contains one column to hold the Address. Each table has City, State and Zip columns that can be compared individually.
For Example - See Below Table1 and Table2:
Notice that the Primary Keys in my pseudo tables below are Key1 and Key2 matching the tables they are in.
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+ + TABLE1 + + + + + +
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| Key1 | Addr_Number | Addr_Street | Addr_Type | City | State | Zip |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
| 1001 | 148 | Panas | Road | Robinson | CA | 76050 |
| 1005 | 110 | 48th | Street | San Juan | NJ | 8691 |
| 1009 | 8571 | Commerce | Loop | Vallejo | UT | 83651 |
| 1059 | 714 | Nettleton | Avenue | Vista | TX | 29671 |
| 1185 | 1587 | Orchard | Drive | Albuquerque | PA | 77338 |
+---------------+---------------+---------------+---------------+---------------+-------+-------+
+---------------+----------------------+---------------+---------------+---------------+
+ + TABLE2 + + + +
+---------------+----------------------+---------------+---------------+---------------+
| Key2 | Address | City | State | Zip |
+---------------+----------------------+---------------+---------------+---------------+
| Ax89f | 148 Panas Road | Robinson | CA | 76050 |
| B184a | 110 48th Street | San Juan | NJ | 08691 |
| B99ff | 8571 Commerce Lp | Vallejo | UT | 83651 |
| D81bc | 714 Nettleton Ave | Vista | TX | 29671 |
| F84a2 | 1587 Orachard Dr | Albuquerqu | PA | 77338 |
+---------------+----------------------+---------------+---------------+---------------+
The goal here is to provide an output to the user that simply displays ALL of the records from Table1 and the highest matched record found in Table2. There could of course be many records that are found that could be a potential match but we want to keep this a one to one relationship and not produce Duplicates in this initial output. The output should just be One Record out of Table one matched to the best find in Table2.
See below an example of the Desired output I am attempting to create:
+--------+-------+----------------+---------------------------+
+ + + Matched_Output + +
+--------+-------+----------------+---------------------------+
| Key1 | Key2 | Percent_Match | num_Matched_Records > 90% |
+--------+-------+----------------+---------------------------+
| 1001 | Ax89f | 100% | 5 | --All Parsed Values Match
| 1005 | B184a | 98% | 4 | --Zip Code prefixed with Zero in Table 2
| 1009 | B99ff | 95% | 3 | --Loop Vs Lp
| 1059 | D81bc | 95% | 2 | --Avenue Vs Ave
| 1185 | F84a2 | 97% | 2 | --City Spelled Wrong in Table 2 and Drive vs Dr
+--------+-------+----------------+---------------------------+
In the output I want to see Key1 from Table1 and the matched record right next to it showing that it matches to the record in Table2 to Key2. Next we are needing to know how well these two records match. There could be many records in Table2 that show a probability to matching a records in Table1. In fact every single record in Table2 can be assigned a percentage all the way from 0% up to a 100% match.
So now to the main question:
How does one obtain this percentage?
How do I Parse the Address column in Table2 so that I can compare each of the individual columns that make up the address in Table1 and then apply comparison algorithm on each parsed value?
So far this is what my team and myself have come up with (Brainstorming, Spitballin, whatever you want to call it).
We have taken a look at a couple of the built in Oracle Functions to obtain the percentages we are looking for as well as trying to utilize Regular Expressions. If I could hit up Google and get some of their Search Algorithms I would. Obviously I don't have that luxury and must design my own.
regexp_count(table2_city,'(^| )'||REPLACE(table1_city,' ','|')||'($| )') city_score,
regexp_count(table2_city,'(^| )') city_max,
to_char((city_score/city_max)*100, '999G999G999G999G990D00')||'%' city_perc,
The above was just what my team and I used as a proof of concept. We have simply selected these values out of the two tables and run the 'regexp_count' function against that columns. Here are a few other functions that we have taken a look at:
SOUNDEX
REGEXP_LIKE
REGEXP_REPLACE
These functions are great but I'm not sure they can be used in a Single Query between both tables to produce the desired output.
Another idea is that we could create a Function() that takes as its parameters the Address fields we are wanting to use to compare. That function would then search Table2 for the highest probable match and return back to the user the Key2 value out of Table2.
Function(Addr_Number, Addr_Street, Addr_type, City, State) RETURN table2.key2
For example maybe something like this 'could' work:
Select tb1.key1, table2Function(tb1.Addr_Number, tb1.Addr_Street, tb1.Addr_type, tb1.City, tb1.State) As Key2
From Table1 tb1;
Lastly, just know that there is roughly 15k records currently in Table1 and 20k records in Table2. Again... each record in Table 1 needs to be checked against each record in Table 2 for a potential match.
I'm all ears. And thank you in advance for your feedback.
Use the UTL_MATCH package:
Oracle Setup:
CREATE TABLE Table1 ( Key1, Addr_Number, Addr_Street, Addr_Type, City, State, Zip ) AS
SELECT 1001, 148, 'Panas', 'Road', 'Robinson', 'CA', 76050 FROM DUAL UNION ALL
SELECT 1005, 110, '48th', 'Street', 'San Juan', 'NJ', 8691 FROM DUAL UNION ALL
SELECT 1009, 8571, 'Commerce', 'Loop', 'Vallejo', 'UT', 83651 FROM DUAL UNION ALL
SELECT 1059, 714, 'Nettleton', 'Avenue', 'Vista', 'TX', 29671 FROM DUAL UNION ALL
SELECT 1185, 1587, 'Orchard', 'Drive', 'Albuquerque', 'PA', 77338 FROM DUAL;
CREATE TABLE Table2 ( Key2, Address, City, State, Zip ) AS
SELECT 'Ax89f', '148 Panas Road', 'Robinson', 'CA', '76050' FROM DUAL UNION ALL
SELECT 'B184a', '110 48th Street', 'San Juan', 'NJ', '08691' FROM DUAL UNION ALL
SELECT 'B99ff', '8571 Commerce Lp', 'Vallejo', 'UT', '83651' FROM DUAL UNION ALL
SELECT 'D81bc', '714 Nettleton Ave', 'Vista', 'TX', '29671' FROM DUAL UNION ALL
SELECT 'F84a2', '1587 Orachard Dr', 'Albuquerqu', 'PA', '77338' FROM DUAL;
Query:
SELECT Key1,
Key2,
UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) AS Percent_Match,
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type,
B.Address
) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.City, B.City ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.State, B.State ) >= 90
THEN 1
ELSE 0
END
+
CASE WHEN UTL_MATCH.EDIT_DISTANCE_SIMILARITY( A.Zip, B.Zip ) >= 90
THEN 1
ELSE 0
END AS Num_Matched
FROM Table1 A
INNER JOIN
Table2 B
ON ( SYS.UTL_MATCH.EDIT_DISTANCE_SIMILARITY(
A.Addr_Number || ' ' || A.Addr_Street || ' ' || A.Addr_Type
|| ' ' || A.City || ' ' || A.State || ' ' || A.Zip,
B.Address || ' ' || B.City || ' ' || B.State || ' ' || B.Zip
) > 80 );
Output:
KEY1 KEY2 PERCENT_MATCH NUM_MATCHED
---------- ----- ------------- -----------
1001 Ax89f 100 4
1005 B184a 97 3
1009 B99ff 95 3
1059 D81bc 92 3
1185 F84a2 88 3
A few thoughts.
First, you may want to take a look at the utl_match package:
https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
Then: you surely will want to match by ZIP code and state first. Perhaps adding leading zeros to ZIP code where needed - although apparently one of your concerns is typos, not just different packaging of the input data. If there are typos in the ZIP code you can more or less deal with that, but if there are typos in the state that really sucks.
You may want to score the similarity by city, but often that won't help. For example, for all practical purposes Brooklyn, NY should be seen as matching New York City, NY but there's no way you can do that in your project. So I would put a very low weight on matching by city.
Similar comment about the address type; perhaps you can create a small table with equivalencies, such as Street, Str, Str. or Lane, Ln, Ln. But the fact is often people are not consistent when they give you an address; they may say "Clover Street" to one source and "Clover Avenue" to another. So you may be better off comparing only the street number and the street name.
Good luck!

How to get latest two rows with certain value by date in SQL [duplicate]

This question already has answers here:
Get top results for each group (in Oracle)
(5 answers)
Closed last year.
My question is that I have certain table with some varchar2 values and insert date.
What I want to do is to get latest two such entries grouped by this varchar2 value
Is it possible to include some top(2) instead of max in Oracle group by ?
EDIT Updated to not count duplicate date value for the same varchar2.
Replaced RANK() with DENSE_RANK() such that it assigns consecutive ranks, then used distinct to eliminate the duplicates.
You can use DENSE_RANK()
SELECT DISTINCT TXT, ENTRY_DATE
FROM (SELECT txt,
entry_date,
DENSE_RANK () OVER (PARTITION BY txt ORDER BY entry_date DESC)
AS myRank
FROM tmp_txt) Q1
WHERE Q1.MYRANK < 3
ORDER BY txt, entry_date DESC
Input:
txt | entry_date
xyz | 03/11/2014
xyz | 25/11/2014
abc | 19/11/2014
abc | 04/11/2014
xyz | 20/11/2014
abc | 02/11/2014
abc | 28/11/2014
xyz | 25/11/2014
abc | 28/11/2014
Result:
txt | entry_date
abc | 28/11/2014
abc | 19/11/2014
xyz | 25/11/2014
xyz | 20/11/2014

vsql/vertica, how to copy text input file's name into destination table

I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.

Is there a Hive equivalent of SQL "not like"

While Hive supports positive like queries: ex.
select * from table_name where column_name like 'root~%';
Hive Does not support negative like queries: ex.
select * from table_name where column_name not like 'root~%';
Does anyone know an equivalent solution that Hive does support?
Try this:
Where Not (Col_Name like '%whatever%')
also works with rlike:
Where Not (Col_Name rlike '.*whatever.*')
NOT LIKE have been supported in HIVE version 0.8.0, check at JIRA.
https://issues.apache.org/jira/browse/HIVE-1740
In SQL:
select * from table_name where column_name not like '%something%';
In Hive:
select * from table_name where not (column_name like '%something%');
Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual if you haven't. I reference it all the time when I'm writing queries for hive.
I haven't done anything where I'm trying to match part of a word, but you might check out RLIKE (in this section https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Relational_Operators)
This is probably a bit of a hack job, but you could do a sub query where you check if it matches the positive value and do a CASE (http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Conditional_Functions) to have a known value for the main query to check against to see if it matches or not.
Another option is to write a UDF which does the checking.
I'm just brainstorming while sitting at home with no access to Hive, so I may be missing something obvious. :)
Hope that helps in some fashion or another. \^_^/
EDIT: Adding in additional method from my comment below.
For your provided example colName RLIKE '[^r][^o][^o][^t]~\w' That may not be the optimal REGEX, but something to look into instead of sub-queries
Using regexp_extract works as well:
select * from table_name where regexp_extract(my_column, ('myword'), 0) = ''
Actually, you can make it like this:
select * from table_name where not column_name like 'root~%';
In impala you can use != for not like:
columnname != value
as #Sanjiv answered
hive has support not like
0: hive> select * from dwtmp.load_test;
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 1929SEGH2 | BSTN |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
6 rows selected (0.224 seconds)
0: hive> select * from dwtmp.load_test where item_name like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 1929SEGH2 | BSTN |
+--------------------+----------------------+
1 row selected (0.271 seconds)
0: hive> select * from dwtmp.load_test where item_name not like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
5 rows selected (0.247 seconds)

Resources