Awk read two files (first may be empty) - bash

I have the following script:
awk '
# Write 1st file into array
NR == FNR {
array[NR] = $0;
next;
}
# Process 2nd file
{
...
} ' file1 file2
What I want is to write 1st file into array and later use this array in 2nd file. First file may be empty, my problem appears when awk read empty file, it does not execute any user-level awk program code and skip to the second file. When awk is reading 2nd file NR == FNR is true and the awk program write 2nd file into array.
How I can avoid it, so only first file will be put into array if exist?

Use this condition to safeguard empty file scenarion:
ARGV[1]==FILENAME && FNR==NR {
array[NR] = $0
next
}
ARGV[1] will be set to first filename in the awk command line and FILENAME will represent the current filename being processed.

Related

Add location to duplicate names in a CSV file using Bash

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.
id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,
The result should look like this:
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,uspamella#mail.com
3,CY,Korrie,cykorrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
I used AWK to process the csv file.
cat data.csv | awk 'BEGIN {FS=OFS=","};
NR > 1 {
split($3, name)
$4 = tolower($3)
split($4, login)
for (k in login) {
!a[login[k]]++ ? sub(login[k], login[k]"#mail.com", $4) : sub(login[k], tolower($2)login[k]"#mail.com", $4)
}
}; 1' > data_new.csv
The script adds location values only to further duplicates.
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,pamella#mail.com
3,CY,Korrie,korrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
How do I add location to the initial one?
A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.
Notice also that this requires you to avoid the useless use of cat.
awk 'BEGIN {FS=OFS=","};
NR == FNR { ++seen[$3]; next }
FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "#mail.com" }
1' data.csv data.csv >data_new.csv
NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.
Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

How to Compare two files line by line and output the whole line if different

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

How to alter number of columns [with awk] only if a string is in the 1st column of the line while printing changed line and whole text

I want to replace the number of columns, only use 1st and last one for each line containing a >.
But then I want to print the whole file again, with the changed lines like this.
>TRF [name1]
AAAAAAAAAAAAAAAAAAAAAAAAAAATTGGA
ATGGGGGGGGGGGGGGGGGGGGGGGGGC
I have tried with this code but it only returns the changed lines. Thanks.
awk '$1 ~ />/ { print $1" "$NF}' file
You can use:
awk '$1 ~ />/ { $0 = $1 " " $NF} 1' file
Default action 1 in the end will print all lines from input.

how can i make awk process the BEGIN block for each file it parses?

i have an awk script that i'm running against a pair of files. i'm calling it like this:
awk -f script.awk file1 file2
script.awk looks something like this:
BEGIN {FS=":"}
{ if( NR == 1 )
{
var=$2
FS=" "
}
else print var,"|",$0
}
the first line of each file is colon-delimited. for every other line, i want it to return to the default whitespace file seperator.
this works fine for the first file, but fails because FS is not reset to : after each file, because the BEGIN block is only processed once.
tldr: is there a way to make awk process the BEGIN block once for each file i pass it?
i'm running this on cygwin bash, in case that matters.
If you're using gawk version 4 or later there's the BEGINFILE block. From the manual:
BEGINFILE and ENDFILE are additional special patterns whose bodies are executed before reading the first
record of each command line input file and after reading the last record of each file. Inside the BEGINFILE
rule, the value of ERRNO will be the empty string if the file could be opened successfully. Otherwise, there
is some problem with the file and the code should use nextfile to skip it. If that is not done, gawk produces
its usual fatal error for files that cannot be opened.
For example:
touch a b c
awk 'BEGINFILE { print "Processing: " FILENAME }' a b c
Output:
Processing: a
Processing: b
Processing: c
Edit - a more portable way
As noted by DennisWilliamson you can achieve a similar effect with FNR == 1 at the beginning of your script. In addition to this you could change FS from the command-line directly, e.g.:
awk -f script.awk FS=':' file1 FS=' ' file2
Here the FS variable will retain whatever value it had previously.
Instead of:
BEGIN {FS=":"}
use:
FNR == 1 {FS=":"}
The FNR variable should do the trick for you. It's the same as NR except it is scoped within the file, so it resets to 1 for every input file.
http://unstableme.blogspot.ca/2009/01/difference-between-awk-nr-and-fnr.html
http://www.unix.com/shell-programming-scripting/46931-awk-different-between-nr-fnr.html
When you want a POSIX complient version, the best is to do:
(FNR == 1) { FS=":"; $0=$0 }
This states that, if the File record number (FNR) equals one, we reset the field separator FS. However, you also need to reparse $0 and reset the values of all other fields and the NF built-in variable.
This is equivalent to the GNU awk 4.x BEGINFILE if and only if the record separator (RS) stays unchanged.

Resources