CSV - How to add columns based on an existing column? - bash

What is the best way to do this and how?
I gather things called sed, AWK and bash may be relevant.
I have used AWK once for one command, the others never.
I have searched and other apparently similar questions do not have an answer I need.
I have columns I have called fields in a CSV file:
_________________________
field1 | field2 | field3|
-------------------------
1990AB | 123456 | 123456|
-------------------------
I want to add fields based on these three original fields to appear as follows:
_______________________________________________________
field1 | field2 | field3 | field1a | field2a | field3a |
-------------------------------------------------------
1990AB | 123456 | 123456| 1990 | 12345 | 12345 |
-------------------------------------------------------
where:
field1a 1990 column 1 first 4 always digits then alpha
field2a 12345 column 2 is always 6 digits
field3a 12345 column 3 is always 6 digits
These are one-time-per-file actions, prior to database import.
macosx has about 6 million records. 2nd attempt at this question as my first was apparently not good. In this area I am a 100% novice.

awk to the rescue!
this should be easy to read even if you have no prior experience with awk
$ awk -F, -v OFS=, 'NR==1 {for(i=1;i<=3;i++) $(++NF)=$i"a"}
NR>1 {$(++NF)=substr($1,1,4);
$(++NF)=substr($2,1,5);
$(++NF)=substr($3,1,5)}1' file
NR is line number, special treatment for header, NF is number of fields, here incrementing for each additional column and $i is field value at position i. The last 1 is shorthand for printing the line. Initial options are for setting input field delimiter (F) and output field delimiter (OFS) to comma.

Related

How to put pivot table using Shell script

I have data in a CSV file as below...
Emailid Storeid
a#gmail.com 2000
b#gmail.com 2001
c#gmail.com 2000
d#gmail.com 2000
e#gmail.com 2001
I am expecting below output, basically finding out how many email ids are there for each store.
StoreID Emailcount
2000 3
2001 2
So far i tried to solve my issue
IFS=","
while read f1 f2
do
awk -F, '{ A[$1]+=$2 } END { OFS=","; for (x in A) print x,A[x]; }' > /home/ec2-user/storewiseemials.csv
done < temp4.csv
With the above shell script i am not getting desired output, Can you guys please help me?
Using miller (https://github.com/johnkerl/miller) and starting from this (I have used a CSV, because I do not know if you use a tab or a white space as separator)
Emailid,Storeid
a#gmail.com,2000
b#gmail.com,2001
c#gmail.com,2000
d#gmail.com,2000
e#gmail.com,2001
and running
mlr --csv count-distinct -f Storeid -o Emailcount input >output
you will have
+---------+------------+
| Storeid | Emailcount |
+---------+------------+
| 2000 | 3 |
| 2001 | 2 |
+---------+------------+

Powerquery - appending the same table to itself using differing columns

So I have a list of properties and a list of the next four servicing dates
e.g:
Property| Last | Next1 | Next2 | Next3 | Next4 |
123 Road| 01-2019 |03-2019| 05-2019| 07-2019| 09-2019|
444 Str | 01-2019 |07-2019| 01-2020| 07-2020| 01-2021|
etc.
I want to see:
Property | Date
123 Road | 01-2019
444 Str | 01-2019
123 Road | 03-2019
123 Road | 05-2019
123 Road | 07-2019
444 Str | 07-2019
etc.
In SQL this would be a union join, in powerquery. I think it's an append, but I'm not sure how to go about it. i.e. how to select columns from a table, then append a table with a different selection. I can append the full table easily, but not certain columns.
Select the date columns and do Transform > Unpivot Columns.
Then you can rename the Value column to Date, remove the Attribute column if you want, and sort as desired.

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Bash and awk: converting a field from 12 hour to 24 hour clock time

I have a large txt file space delimited which I split into 18 smaller files (each with their own number of columns). This split is based on a delimiter i.e. whenever the timestamp hits midnight. So effectively, I'll end up with a 18 files in the form of (note, ignore the dashes and pipes, I've used them to improve the readability):
file1
time ----------- valueA - valueB
12:00:00 AM | 54.13 | 239.12
12:00:01 AM | 51.83 | 119.93
..
file18
time ---------- valueA - valueB - valueC - valueD
12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12
12:00:01 AM | 23.92 | 121.92 | 201.23 | 892.12
..
Once I split the file I then perform some processing on each of the files using AWK so in short there's 2 stages the 'split stage' and the 'processing stage'.
Unfortunately, the timestamp contained in the large txt file is in 1 of 2 formats. Either the desirable 24 hour format of "00:00:01" or the undesirable 12 hour format of "12:00:01 AM".
As a result, I'm trying to convert all formats to be 24 hours and I'm not sure how to do this. I'm also not sure whether to attempt this at the split stage using bash or at the process stage using AWK. I know that the following function converts 12 hour to 24 hr
'date --date="12:00:01 AM" +%T'
however, I'm not sure how to incorporate this into my shell script were I'm using 'while read line' at the 'split stage' or whether I should do the time conversion in AWK (if possible?) at the 'processing stage'.
see the test below, is it helpful for you?
kent$ echo "12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12 "\
|awk -F'|' 'BEGIN{OFS="|"}{("date --date=\""$1"\" +%T") |getline $1;print }'
output
00:00:00| 54.92 | 239.12 | 231.23 | 882.12

Oracle SQL Loader split data into different tables

I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.
Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.
options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)
Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat

Resources