Parsing and replacing columns in a file - bash

INPUT:
I have an input file in which the first 10 characters of each line represent 2 fields - first 4 characters(field A) and the next 6 characters(field B). The file contains about 400K records.
I have a Mapping table which contains about 25M rows and looks like
Field A Field B SomeStringA SomeStringB
1628 836791 1234 783901
afgd ahutwe 1278 ashjkl
--------------------------------
--------------------------------
and so on.
Field A and Field B combined is the Primary Key for the table.
PROBLEM STATEMENT:
Replace:
Field A by SomeStringA
Field B by SomeStringB
in the input file. SomeStringA and SomeStringB are exactly the same width as Field A and B respectively.
Here's what I'm trying:
Approach 1:
Sort and Dump the mapping table into a file
spool dump_file
select * from mapping order by fieldA, fieldB;
spool off
exit;
Strip the input file and get the first 10 chars
cut -c1-10 input_file > input_file_stripped
Do something to find the lines that begin with the same string and then when they do - replace in the input_file with field 10-20 in the spooled file. - here's where I'm stuck.
Approach 2:
Take the input file and get the first 10 chars
cut -c1-10 input_file >input_file_stripped
Use sqlldr and load into a temp_table.
Select matching records from the mapping table and spool
spool matching_records
select m.* from mapping m, temp t where m.fieldA=t.fieldA and m.fieldB=t.fieldB;
spool off
exit;
Now how do I replace these in the original file ?
Given the high number of records to process, how can this be done and done fast ?
Notes:
Not a one time activity, has to be done daily so scale is important
The mapping table is unlikely to change
I've Python, shell script and Oracle database available. Any combination of these is fine.

Related

Supress leading zeros from oracle table extract to a file

I am extracting data from oracle table to a text file and I have below number columns. When I select the below columns to a file it gives me all leading zeros which I wanted to suppress.
Select ltrim(col_1,'0'),ltrim(col_2,'0'),ltrim(col_3,'0') from table1
Datatype:
Col_1 ---NUMBER(10,2),
Col_2 ---NUMBER(38,0),
Col_3 ---NUMBER(15,1)
Current Output:
00000303.44|0| 00000000000008.2
00000000.00|26| 00000000000030.2
00000473.40|0| 00000000000010.0
Expected Output:
303.44|0|8.2
0|26|30.2
473.4|0|10
Please let me know if i need to change the datatype to get the Expected output. I even tried TO_CHAR(TRIM(LEADING 0 FROM col_name) i did not get the expected output.
This is caused by the datatypes set in the last output stage of your datastage job. When a column is set a decimal, datastage will fill the remaining positions with leading zeros up to the size if your decimal field.
The easiest way to get around this is to place a transform prior to the file output stage and convert all the columns to a varchar at the last stage trimming all the leading zeros.
Since the data is not in number and possibly in varchar/varchar2;
conversion is required; you can use to_number to address this;
Using one of your sample data in below case
select
to_number(00000000000008.2) as num1,
to_number('00000000000008.2') as chr1,
trim(00000000000008.2) as num2,
trim('00000000000008.2') as chr2,
ltrim(00000000000008.2,'0') as num3,
ltrim('00000000000008.2','0') as char3
from dual

fast infix search and count on a 40 million rows table postgresql

I'm new to database administration, but I need to create a database view, while the db admin requires it to run in 5 mins or less. My database is PostgreSQL 9.1.1 on RedHat4.4 linux 64-bit. I'm unsure about the hardware specifications. One of the tables is 40million rows. From the table, I have a column of directory paths from which I must group by about 20 string patterns and count its occurrences. The string pattern requires infix search, as it could be somewhere in the middle or end of the path. The string pattern also has a priority, as in when %str1% then 'str1', when %str2% then 'str2', and both str1, str2, str3, etc can occur on the same path, i.e.
path
/usr/myblock/str1/str2
/usr/myblock/something/str2
/usr/myblock/str1/something/str3
What I did so far was to build a table out of CASE statements then join it back to the original table by LIKE, then SELECT id, pattern, count(pattern). The query runtime was terrible, taking 5mins to retrieve from 5.5K rows. My query looks like this:
WITH a AS (
SELECT CASE
WHEN path ~ '^/usr/myblock/(.*)str1(.*)' THEN 'str1'
WHEN path ~ '^/usr/myblock/(.*)str2$' THEN 'str2'
WHEN path ~ '^/usr/myblock/(.*)str3$' THEN 'str3'
.... --multiple other case conditions
WHEN path ~ '^/usr/myblock/' THEN 'others'
ELSE 'n/a'
END as flow
FROM mega_t WHERE left(path,13)='/usr/myblock/' limit 5)
SELECT id, a.flow, count(*) AS flow_count FROM a
JOIN mega_t ON path LIKE '%' || a.flow || '%'
WHERE (some_conditions) AND to_timestamp(test_runs.created_at::double precision)
> ('now'::text::date - '1 mon'::interval) --collect last 1 month's results only
GROUP BY id, a.flow;
My expected output for that simple case would be:
id | flow | flow_count
1 | str1 | 2
2 | str2 | 1
What is a better way to search for substrings like this and count occurrences? I can't use ts_stat, nor 'SELECT count(path) WHERE path LIKE %str1%' because of the if-else priority it needs. I read about creating trigram indexes, but I think that is overkill for my patterns. I hope this question is clear and useful. Another thing I should add is that the 40million rows table is updated frequently every few seconds or minutes while the view will be accessed every eight hours daily.

concatenate multiple fields in sqlldr

I am working on sqlldr(sql loader) in oracle 11g.
I am trying to concatenate 3 fields into a single field. Has anyone done this?
ex:
TABLE - "CELLINFO" where the fields are (mobile_no,service,longitude).
The data given is (+9198449844,idea,110,25,50) i.e. (mobile_no,service,grad,min,sec).
But while loading data into the table i need to concatenate the last 3 fields (grad,min,sec) into the longitude field of the table.
Here i cant edit manually because i have 1000's of data to be loaded.
I also tried using ||,+ and concat().... but I am not able to.
ctl may be:
load data
append
into table cellinfo
fields terminated by ","
(
mobile_no,
service,
grad BOUNDFILLER,
min BOUNDFILLER,
sec BOUNDFILLER,
latitude ":grad || :min|| :sec"
)
suposing cellinfo(mobile_no, service, latitude).
Some nice info here on orafaq
Alternatively, you can modify your input:
awk -F"," '{print $1","$2","$3":"$4":"$5}' inputfile > outputfile

How to split FoxPro records?

I have 60,000 records in the dbf file in FoxPro. I want to split it into each 20,000 records (20000 * 3 = 60,000).
How can I achieve this?
I am new to FoxPro. I am using Visual FoxPro 5.0.
Thanks in advance.
You must issue a SKIP command when using the COPY command to make sure you are starting on the next record.
USE MyTable
GO TOP
COPY NEXT 20000 TO NewTable1
SKIP 1
COPY NEXT 20000 TO NewTable2
SKIP 1
COPY NEXT 20000 TO NewTable3
Todd's suggestion will work if you don't care how the records are split. If you want to divide them up based on their content, you'll want to do something like Stuart's first suggestion, though his exact answer will only work if the IDs for the records run from 1 to 60,000 in order.
What's the ultimate goal here? Why divide the table up?
Tamar
You can directly select from the first table:
SELECT * from MyBigTable INTO TABLE SmallTable1 WHERE ID < 20000
SELECT * from MyBigTable INTO TABLE SmallTable2 WHERE ID BETWEEN (20000, 39999)
SELECT * from MyBigTable INTO TABLE SmallTable3 WHERE ID > 39999
if you want more control, though, or you need to manipulate the data, you can use xbase code, something like this:
SELECT MyBigTable
scan
scatter name oRecord memo
if oRecord.Id < 20000
select SmallTable1
append blank
gather name oRecord memo
else if oRecord.Id < 40000
select SmallTable2
append blank
gather name oRecord memo
else
select SmallTable3
append blank
gather name oRecord memo
endscan
It's been a while since I used VFP and I don't have it here, so apologies for any syntax errors.
use in 0 YourTable
select YourTable
go top
copy to NewTable1 next 20000
copy to NewTable2 next 20000
copy to NewTable3 next 20000
If you wanted to split based on record numbers, try this:
SELECT * FROM table INTO TABLE tbl1 WHERE RECNO() <= 20000
SELECT * FROM table INTO TABLE tbl2 WHERE BETWEEN(RECNO(), 20001, 40000)
SELECT * FROM table INTO TABLE tbl3 WHERE RECNO() > 40000

Oracle - dynamic column name in select statement

Question:
Is it possible to have a column name in a select statement changed based on a value in it's result set?
For example, if a year value in a result set is less than 1950, name the column OldYear, otherwise name the column NewYear. The year value in the result set is guaranteed to be the same for all records.
I'm thinking this is impossible, but here was my failed attempt to test the idea:
select 1 as
(case
when 2 = 1 then "name1";
when 1 = 1 then "name2")
from dual;
You can't vary a column name per row of a result set. This is basic to relational databases. The names of columns are part of the table "header" and a name applies to the column under it for all rows.
Re comment: OK, maybe the OP Americus means that the result is known to be exactly one row. But regardless, SQL has no syntax to support a dynamic column alias. Column aliases must be constant in a query.
Even dynamic SQL doesn't help, because you'd have to run the query twice. Once to get the value, and a second time to re-run the query with a different column alias.
The "correct" way to do this in SQL is to have both columns, and have the column that is inappropriate be NULL, such as:
SELECT
CASE WHEN year < 1950 THEN year ELSE NULL END AS OldYear,
CASE WHEN year >= 1950 THEN year ELSE NULL END AS NewYear
FROM some_table_with_years;
There is no good reason to change the column name dynamically - it's analogous to the name of a variable in procedural code - it's just a label that you might refer to later in your code, so you don't want it to change at runtime.
I'm guessing what you're really after is a way to format the output (e.g. for printing in a report) differently depending on the data. In that case I would generate the heading text as a separate column in the query, e.g.:
SELECT 1 AS mydata
,case
when 2 = 1 then 'name1'
when 1 = 1 then 'name2'
end AS myheader
FROM dual;
Then the calling procedure would take the values returned for mydata and myheader and format them for output as required.
You will need something similar to this:
select 'select ' || CASE WHEN YEAR<1950 THEN 'OLDYEAR' ELSE 'NEWYEAR' END || ' FROM TABLE 1' from TABLE_WITH_DATA
This solution requires that you launch SQLPLUS and a .sql file from a .bat file or using some other method with the appropriate Oracle credentials. The .bat file can be kicked off manually, from a server scheduled task, Control-M job, etc...
Output is a .csv file. This also requires that you replace all commas in the output with some other character or risk column/data mismatch in the output.
The trick is that your column headers and data are selected in two different SELECT statements.
It isn't perfect, but it does work, and it's the closest to standard Oracle SQL that I've found for a dynamic column header outside of a development environment. We use this extensively to generate recurring daily/weekly/monthly reports to users without resorting to a GUI. Output is saved to a shared network drive directory/Sharepoint.
REM BEGIN runExtract1.bat file -----------------------------------------
sqlplus username/password#database #C:\DailyExtracts\Extract1.sql > C:\DailyExtracts\Extract1.log
exit
REM END runExtract1.bat file -------------------------------------------
REM BEGIN Extract1.sql file --------------------------------------------
set colsep ,
set pagesize 0
set trimspool on
set linesize 4000
column dt new_val X
select to_char(sysdate,'MON-YYYY') dt from dual;
spool c:\DailyExtracts\&X._Extract1.csv
select '&X-Project_id', 'datacolumn2-Project_Name', 'datacolumn3-Plant_id' from dual;
select
PROJ_ID
||','||
replace(PROJ_NAME,',',';')-- "Project Name"
||','||
PLANT_ID-- "Plant ID"
from PROJECTS
where ADDED_DATE >= TO_DATE('01-'||(select to_char(sysdate,'MON-YYYY') from dual));
spool off
exit
/
REM ------------------------------------------------------------------
CSV OUTPUT (opened in Excel and copy/pasted):
old 1: select '&X-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
new 1: select 'MAR-2018-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
MAR-2018-Project_id datacolumn2-Project_Name datacolumn3-Plant_id
31415 name1 1007
31415 name1 2032
32123 name2 3302
32123 name2 3384
32963 name3 2530
33629 name4 1161
34180 name5 1173
34180 name5 1205
...
...
etc...
135 rows selected.

Resources