Split csv file by multiple columns values and keep header - bash

I'm trying to split big tsv file into smaller parts depending on column value but I need to keep header in every file that was created by splitting. How can I do this?
I've tried some solutions but they can solve my problem only for particular files
awk -F'\t' 'NR==1 {h=$0};NR>1{print ((!a[$5]++ && !a[$9]++ && !a[$10]++)? h ORS $0 : $0) > "file_first-" $5 "_second-" $9 "_third-" $10 ".tsv"}' file.tsv
I expect to have header in each file, but for now it is only in files, where $5 $9 $10 are in such format : 1 1 1 2 2 2... But not the permutations.

You probably want to have the following per-line logic:
calculate output_file
If !header_sent[output_file]
Print Header to output_file
set header_sent[output_file]
EndIf
print current line to output_file
Implementation in AWK below. Can be converted into one-liner by removing comments, and compacting variable names, etc.
NR == 1 { header=$0 }
NR > 1 {
output_file = "file_first-" $5 "_second-" $9 "_third-" $10 ".tsv"
# Send header, if not sent to this file yet.
if (!header_sent[output_file] ) {
print header > output_file
header_sent[output_file] = 1
}
# Print the current line
print $0 > output_file
}

Related

Add location to duplicate names in a CSV file using Bash

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.
id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,
The result should look like this:
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,uspamella#mail.com
3,CY,Korrie,cykorrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
I used AWK to process the csv file.
cat data.csv | awk 'BEGIN {FS=OFS=","};
NR > 1 {
split($3, name)
$4 = tolower($3)
split($4, login)
for (k in login) {
!a[login[k]]++ ? sub(login[k], login[k]"#mail.com", $4) : sub(login[k], tolower($2)login[k]"#mail.com", $4)
}
}; 1' > data_new.csv
The script adds location values only to further duplicates.
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,pamella#mail.com
3,CY,Korrie,korrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
How do I add location to the initial one?
A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.
Notice also that this requires you to avoid the useless use of cat.
awk 'BEGIN {FS=OFS=","};
NR == FNR { ++seen[$3]; next }
FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "#mail.com" }
1' data.csv data.csv >data_new.csv
NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.
Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

How to assign awk result variable to an array and is it possible to use awk inside another awk in loop

I've started to learn bash and totally stuck with the task. I have a comma separated csv file with records like:
id,location_id,organization_id,service_id,name,title,email,department
1,1,,,Name surname,department1 department2 department3,,
2,1,,,name Surname,department1,,
3,2,,,Name Surname,"department1 department2, department3",, e.t.c.
I need to format it this way: name and surname must start with a capital letter
add an email record that consists of the first letter of the name and full surname in lowercase
create a new csv with records from the old csv with corrected fields.
I split csv on records using awk ( cause some fields contain fields with a comma between quotes "department1 department2, department3" ).
#!/bin/bash
input="$HOME/test.csv"
exec 0<$input
while read line; do
awk -v FPAT='"[^"]*"|[^,]*' '{
...
}' $input)
done
inside awk {...} (NF=8 for each record), I tried to use certain field values ($1 $2 $3 $4 $5 $6 $7 $8):
#it doesn't work
IFS=' ' read -a name_surname<<<$5 # Field 5 match to *name* in heading of csv
# Could I use inner awk with field values of outer awk ($5) to separate the field value of outer awk $5 ?
# as an example:
# $5="${awk '{${1^}${2^}}' $5}"
# where ${1^} and ${2^} fields of inner awk
name_surname[0]=${name_surname[0]^}
name_surname[1]=${name_surname[1]^}
$5="${name_surname[0]}' '${name_surname[1]}"
email_name=${name_surname[0]:0:1}
email_surname=${name_surname[1]}
domain='#domain'
$7="${email_name,}${email_surname,,}$domain" # match to field 7 *email* in heading of csv
how to add field values ($1 $2 $3 $4 $5 $6 $7 $8) to array and call function join for each for loop iteration to add record to new csv file?
function join { local IFS="$1"; shift; echo "$*"; }
result=$(join , ${arr[#]})
echo $result >> new.csv
This may be what you're trying to do (using gawk for FPAT as you already were doing) but without more representative sample input and the expected output it's a guess:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
NR > 1 {
n = split($5,name,/\s*/)
$7 = tolower(substr(name[1],1,1) name[n]) "#example.com"
print
}
' "${#:--}"
$ ./tst.sh test.csv
1,1,,,Name surname,department1 department2 department3,nsurname#example.com,
2,1,,,name Surname,department1,nsurname#example.com,
3,2,,,Name Surname,"department1 department2, department3",nsurname#example.com,
I put the awk script inside a shell script since that looks like what you want, obviously you don't need to do that you could just save the awk script in a file and invoke it with awk -f.
Completely working answer by Ed Morton.
If it may be will be helpful for someone, I added one more checking condition: if in CSV file more than one email address with the same name - index number is added to email local part and output is sent to file
#!/usr/bin/env bash
input="$HOME/test.csv"
exec 0<$input
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
(NR == 1) {print} #header of csv
(NR > 1) {
if (length($0) > 1) { #exclude empty lines
count = 0
n = split($5,name,/\s*/)
email_local_part = tolower(substr(name[1],1,1) name[n])
#array stores emails from csv file
a[i++] = email_local_part
#find amount of occurrences of the same email address
for (el in a) {
ret=match(a[el], email_local_part)
if (ret == 1) { count++ }
}
#add number of occurrence to email address
if (count == 1) { $7 = email_local_part "#abc.com" }
else { --count; $7 = email_local_part count "#abc.com" }
print
}
}
' "${#:--}" > new.csv

Editing text in Bash

I am trying to edit text in Bash, i got to point where i am no longer able to continue and i need help.
The text i need to edit:
Symbol Name Sector Market Cap, $K Last Links
AAPL
Apple Inc
Computers and Technology
2,006,722,560
118.03
AMGN
Amgen Inc
Medical
132,594,808
227.76
AXP
American Express Company
Finance
91,986,280
114.24
BA
Boeing Company
Aerospace
114,768,960
203.30
The text i need:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
I already tried :
sed 's/$/,/' BIPSukol.txt > BIPSukol1.txt | awk 'NR==1{print}' BIPSukol1.txt | awk '(NR-1)%5{printf "%s ", $0;next;}1' BIPSukol1.txt | sed 's/.$//'
But it doesnt quite do the job.
(BIPSukol1.txt is the name of the file i am editing)
The biggest problem you have is you do not have consistent delimiters between your fields. Some have commas, some don't and some are just a combination of 3-fields that happen to run together.
The tool you want is awk. It will allow you to treat the first line differently and then condition the output that follows with convenient counters you keep within the script. In awk you write rules (what comes between the outer {...} and then awk applies your rules in the order they are written. This allows you to "fix-up" your hap-hazard format and arrive at the desired output.
The first rule applied FNR==1 is applied to the 1st line. It loops over the fields and finds the problematic "Market Cap $K" field and considers it as one, skipping beyond it to output the remaining headings. It stores a counter count = NF - 3 as you only have 5 lines of data for each Symbol, and skips to the next record.
When count==n the next rule is triggered which just outputs the records stored in the a[] array, zeros count and deletes the a[] array for refilling.
The next rule is applied to every record (line) of input from the 2nd-on. It simply removes any whitespece from the fields by forcing awk to recalculate the fields with $1 = $1 and then stores the record in the array incrementing count.
The last rule, END is a special rule that runs after all records are processed (it lets you sum final tallies or output final lines of data) Here it is used to output the records that remain in a[] when the end of the file is reached.
Putting it altogether in another cut at awk:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
for (i=1;i<=n;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
delete a
count = 0
}
{
$1 = $1
a[++count] = $0
}
END {
for (i=1;i<=count;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
}
' file
Example Use/Output
Note: you can simply select-copy the script above and then middle-mouse-paste it into an xterm with the directory set so it contains file (you will need to rename file to whatever your input filename is)
$ awk '
> FNR==1 {
> for (i=1;i<=NF;i++)
> if ($i == "Market") {
> printf ",Market Cap $K"
> i = i + 2
> }
> else
> printf (i>1?",%s":"%s"), $i
> print ""
> n = NF-3
> count = 0
> next
> }
> count==n {
> for (i=1;i<=n;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> delete a
> count = 0
> }
> {
> $1 = $1
> a[++count] = $0
> }
> END {
> for (i=1;i<=count;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> }
> ' file
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
(note: it is unclear why you want the "Links" heading included since there is no information for that field -- but that is how your desired output is specified)
More Efficient No Array
You always have afterthoughts that creep in after you post an answer, no different than remembering a better way to answer a question as you are walking out of an exam, or thinking about the one additional question you wished you would have asked after you excuse a witness or rest your case at trial. (there was some song that captured it -- a little bit ironic :)
The following does essentially the same thing, but without using arrays. Instead it simply outputs the information after formatting it rather than buffer it in an array for output all at once. It was one of those type afterthoughts:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
print ""
count = 0
}
{
$1 = $1
printf (++count>1?",%s":"%s"), $0
}
END { print "" }
' file
(same output)
With your shown samples, could you please try following(written and tested in GNU awk). Considering that(by seeing OP's attempts) after header of Input_file you want to make every 5 lines into a single line.
awk '
BEGIN{
OFS=","
}
FNR==1{
NF--
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
OR if your awk doesn't support NF-- then try following.
awk '
BEGIN{
OFS=","
}
FNR==1{
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +Links( +)?$/,"",lastPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
NOTE: Looks like your header/first line needed special manipulation because we can't simply set , for all spaces, so taken care of it in this solution as per shown samples.
With GNU awk. If your first line is always the same.
echo 'Symbol,Name,Sector,Market Cap $K,Last,Links'
awk 'NR>1 && NF=5' RS='\n ' ORS='\n' FS='\n' OFS=',' file
Output:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Find which row having less columns using awk

I have a file where there are 4 fields expected for each row. If there are less number of fields then I want to write that information in a logfile with the row number.
Filed1line1| Filed2line1| Filed3line1| Filed4line1
Filed1line2| Filed2line2|
Filed1line3| Filed2line3| Filed3line3| Filed4line3
Something like - Row number 2 is having 3 fields for file a.txt
Can we achieve this using awk.
Actually I am using the below code snippet. If the number of fields is <> 4 then I am writing it in a bad file. that is working good. But I am unable to write NR value in log.
awk -F'|' -v DGFNM="$IN_DIR$DGFNAME" -v DBFNM="$IN_DIR$DBFNAME" '
$1 == "DTL" {
if (NF == 4) {
print substr($0, 5) > DGFNM
} else {
print > DBFNM
print NR >> $logfile
}
}
' "$IN_DIR$IN_FILE"
Easy: NF is the number of fields in the record and NR is the record number.
Something like: awk '{ if (NF < 4) { print "Row " NR " has " NF " fields"; } }' - there are shorter ways, but I prefer longer code that is easier to read ;-)
See this question for some info on printing to different output files: is it possible to print different lines to different output files using awk
To answer your edited question: $logfile is inside the single quotes, so it is not expanded to your shell variable logfile. And it is not an "awk" variable. try print NR >> "some_file"; in the awk, and then rename some_file to $logfile later.
Another option would be to generate the awk file with the expanded $logfile already in place instead of trying to do it inline.

Unix Script to add header in awk script resulting in header on every other line

I am trying to add a header to a split file but with this code the header is appearing every other line:
awk -F, '{print "eid,devicetype,meterid,lat,lng" > $7"-"$6".csv"}{print $1",", $2",", $3",", $4",", $5"," >> $7"-"$6".csv"}' path/filename
The awk code by itself works but I need to apply a header in the file. The script splits the file based on the values in columns 6 & 7 as well as names the end file with those values. Then it removes columns 6 & 7 it only puts columns 1 - 5 in the output file. This is on Unix in a shell script run from PowerCenter.
I am sure it is probably simple fix for others more experienced.
awk '
BEGIN { FS=OFS="," }
{ fname = $7 "-" $6 ".csv" }
!seen[fname]++ { print "eid", "devicetype", "meterid", "lat, "lng" > fname}
{ print $1, $2, $3, $4, $5 > fname }
' path/filename
You can use:
awk -F, '!a[$7,$6]++{print "eid,devicetype,meterid,lat,lng" > $7 "-" $6 ".csv"}
{print $1,$2,$3,$4,$5 > $7 "-" $6 ".csv"}' OFS=, /path/filename.csv
NR==1 will make sure that header is printed for 1st record.

Resources