Adding file information to an AWK comparison - bash

I'm using awk to perform a file comparison against a file listing in found.txt
while read line; do
awk 'FNR==NR{a[$1]++;next}$1 in a' $line compare.txt >> $CHECKFILE
done < found.txt
found.txt contains full path information to a number of files that may contain the data. While I am able to determine that data exists in both files and output that data to $CHECKFILE, I wanted to be able to put the line from found.txt (the filename) where the line was found.
In other words I end up with something like:
File " /xxxx/yyy/zzz/data.txt "contains the following lines in found.txt $line
just not sure how to get the /xxxx/yyy/zzz/data.txt information into the stream.
Appended for clarification:
The file found.txt contains the full path information to several files on the system
/path/to/data/directory1/file.txt
/path/to/data/directory2/file2.txt
/path/to/data/directory3/file3.txt
each of the files has a list of parameters that need to be checked for existence before appending additional information to them later in the script.
so for example, file.txt contains the following fields
parameter1 = true
parameter2 = false
...
parameter35 = true
the compare.txt file contains a number of parameters as well.
So if parameter35 (or any other parameter) shows up in one of the three files I get it's output dropped to the Checkfile.
Both of the scripts (yours and the one I posted) will give me that output but I would also like to echo in the line that is being read at that point in the loop. Sounds like I would just be able to somehow pipe it in, but my awk expertise is limited.

It's not really clear what you want but try this (no shell loop required):
awk '
ARGIND==1 { ARGV[ARGC] = $0; ARGC++; next }
ARGIND==2 { keys[$1]; next }
$1 in keys { print FILENAME, $1 }
' found.txt compare.txt > "$CHECKFILE"
ARGIND is gawk-specific, if you don't have it add FNR==1{ARGIND++}.

Pass the name into awk inside a variable like this:
awk -v file="$line" '{... print "File: " file }'

Related

How to make and name multiple text files after using the cut command?

I have about 50 data text files that I need to remove several columns from.
I have been using the cut command to remove and rename them individually but I will have many more of the files and need a way to do it large scale.
Currently I have been using:
cut -f1,6,7,8 filename.txt >> filename_Fixed.txt
And I am able to remove the columns from all the files using:
cut -f1,6,7,8 *.txt
But I'm only able to get all the output in the terminal or I can write it to a single text file.
What I want is to edit several files using cut to remove the required columns:
filename1.txt
filename2.txt
filename3.txt
filename4.txt
.
.
.
And get the edited output to write to individual files:
filename_Fixed1.txt
filename_Fixed2.txt
filename_Fixed3.txt
filename_Fixed4.txt
.
.
.
But haven't been able to find a way to write the output to new text files. I'm new to using the command line and not much of a coder, so maybe I don't know what terms to search for? I haven't even been able to find anything doing google searches that has helped me. It seems like it should be simple, but I am struggling.
In desperation, I did try this bit of code, knowing it wouldn't work:
cut -f1,6,7,8 *.txt >> ( FILENAME ".fixed" )
I found the portion after ">>" nested in an awk command that output multiple files.
I also tried (again knowing it wouldn't work) to wild card the output files but got an ambiguous redirect error.
Did you try for?
for f in *.txt ; do
cut -f 1,6,7,8 "$f" > $(basename "$f" .txt)_fixed.txt
done
(N.B. I can't try the basename now, you can replace it with "${f}_fixed")
You can also process it all in awk itself which would make the process much more efficient, especially for large numbers of files, for example:
awk '
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
' *.txt
Which contains two rules (the expressions between {...}). The first:
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
simply checks that the file contains at least 8 fields (since you want field 8 as your last field). If the file contains less than 8 fields, it just skips to the next file in your list.
The second rule:
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
fn=FILENAME stores the current filename as fn to cut down typing,
idx=match(fn, /[0-9]+.*$/) locates the index where the numeric suffix for the filename begins (e.g. were "3.txt" starts),
if (idx == 0) then a numeric suffix was not found, warn, and move on to the next file,
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx) form the new filename from the non-numeric prefix (e.g. "filename"), add "_Fixed" with string-concatenation and then add the numeric suffix, and finally
print $1,$6,$7,$8 > newfn print fields (columns) 1,6,7,8 redirecting output to the new filename.
For more information on each of the string-functions used above, see the GNU awk User's Guide - 9.1.3 String-Manipulation Functions
If I understand what you were attempting, this should be able to handle as many files as you have -- so long as the files have a numeric suffix to place "_Fixed" before in the filename and each file has at least 8 fields (columns). You can just copy/middle-mouse-paste the entire command at the command-line to test.

How to collate multiple files in AWK?

I am trying to collate a series of .csv log files that are named by date (e.g., 2019-02-24.csv). There are a bunch of them, so I'm trying to script the process. I've crafted an AWK script that combines individual files:
awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFICE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> usage_history.csv
But I am failing when I try to string the AWK commands together with a control loop in BASH:
for i in {01..28}; do echo "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
When I run this, it prints out the correct commands to the command line, but the awk scripts are not executed (they only get printed). If I run it without echo, I get errors telling me that the file doesn't exist; though all files are present:
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
What am I missing in my loop?
Here is a condensed sample of the command and the error messages:
$ for i in {01..02}; do "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-02.csv >> user_history.csv: No such file or directory
Could you please try following.
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-9]*.csv >> user_history.csv
Here following are the points why one could use this approach:
1- Use of for loop and calling awk command in that each time will be a overkill. We should use smart approach when awk could read multiple files then we should sue it.
2- Now comes the getline part which you tried in your code, so if we want to negate any string then simply negate it by using !/string_to_be_skipped/ so it will look for only those lines which are NOT having this string.
3- While mentioning file(multiple files) to single awk command I used 2019-01-[0-9]*.csv why because since you have NOT told if files will be created daily basis or not so in case we give it a loop style and that specific file is NOT present then we will get an error. For an example let's say I use following awk command where I intentionally removed file named(2019-01-02.csv).
awk '........' 2019-01-{01..29}.csv
awk: cannot open 2019-01-02.csv (No such file or directory)
So to avoid these kind of situations I have used 2019-01-[0-9]*.csv where it will only look for files which have digits after 2019-01-0 and will loop NOT run in a loop and complaint us that some xyz etc file is missing.
Try this:
for i in {01..28}; do awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-$i.csv >>user_history.csv;done
The commands after do should not be quoted.
And what you were doing essentially equals to ignore the title lines.
The {print} after 1 is unnecessary -- single 1 implies {print}. The 1 is to provide a true.
-- When there's only an expression but no block, the block implies to {print}.
-- And only a regexp equals $0~/regex/, and here I negated it.
If there's no other command inside the loop, you can simplify the loop with one awk command:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-{01..28}.csv >>user_history.csv
But this one will throw error and stop executing when one of the files not existed.
Another way is:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-3][0-9].csv >>user_history.csv
This one will only match filenames, instead of loop for them.
It won't stop executing nor throw error, So if there's file missing you wouldn't know. And it will match extra files if exist.
For example it will read 2019-01-34.csv if it exists.
So if you want the warnings (warnings won't affect the results), but don't want the commands to stop, then use the first for loop one.
Pitfalls:
[0-3][1-9] won't match 10,20 and 30, but will match 32 to 39.
[0-9]* will match any longer number, but with 20 to 29 before 3 or likewise, it's string order.
Thanks to #Tiw and #RavinderSingh13 for their guidance. Here is the final awk script that is working well for my case where I have daily files from multiple days, months, and years (only 2018 and 2019 in this case):
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 201[8-9]-[0-1][0-2]-[0-3][0-9].csv >> user_history.csv

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Bash script to efficiently return two file names that both contain a string found in a list

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.
This is an example of the set.csv file
"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","
This is the very inefficient code I wrote
for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt;
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]];
then echo $ID + $FILE1 + $FILE2;
fi;
done;
done;
done
Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.
There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.
Thanks!
Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?
A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming
awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
id=substr($4,4,4)
if(ids[id]) {
print id " is in " ids[id] " and " FILENAME;
} else {
ids[id]=FILENAME;
}
}' *.txt
The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

To extract a string from filename and insert it into the file

I want to write a bash script for extracting a string from the file name and insert that string into a specific location in the same file.
For example:
Under /root dir there are different date directories 20160201, 20160202, 20160203 and under each directory there is a file abc20160201.dat, abc20160202.dat, abc20160203.dat.
My requirement is that I need to extract the date from each file name first, and then insert that date into the second column of each record in the file.
For extracting the date I am using
f=abc20160201.dat
s=`echo $f | cut -c 4-11`
echo "$f -> $s"
and for inserting the date iI am using
awk 'BEGIN { OFS = "~"; ORS = "\n" ; date="20160201" ; IFS = "~"} { $1=date"~"$1 ; print } ' file > tempdate
But in my awk command the date is coming in the first column. Please let me know what I am doing wrong here.
The file on which this operation is being done is a delimited file with fields separated by ~ characters.
Or if anybody has a better solution for this, please let me know.
The variable for the input field separator is FS, not IFS. Consequently, the input line is not being split at all, hence when you add the date after field 1, it appears at the end of the line.
You should be able to use:
f=abc20160201.dat
s=$(echo $f | cut -c 4-11)
awk -v date="$s" 'BEGIN { FS = OFS = "~" } { $1 = $1 OFS date; print }' $f
That generates the modified output to standard output. AFAIK, awk doesn't have an overwrite option, so if you want to modify the files 'in place', you'll write the output of the script to a temporary file, and then copy or move the temporary file over the original (removing the temporary if you copied). Copying preserves both hard links and symbolic links (and owner, group, permissions); moving doesn't. If the file names are neither symlinks nor linked files, moving is simpler. (Copying always 'works', but the copy takes longer than a move, requires the remove, and there's a longer window while the over-writing copy could leave you with an incomplete file if interrupted.)
Generalizing a bit:
for file in /root/2016????/*.dat
do
tmp=$(mktemp "$(dirname "$file")/tmp.XXXXXX)")
awk -v date="$(basename "$file" | cut -c 4-11)" \
'BEGIN { FS = OFS = "~" } { $1 = $1 OFS date; print }' "$file" >"$tmp"
mv "$tmp" "$file"
done
One of the reasons for preferring $(…) over back-quotes is that it is much easier to manage nested operations and quoting using $(…). The mktemp command creates the temporary file in the same directory as the source file; you can legitimately decide to use mktemp "${TMPDIR:-/tmp}/tmp.XXXXXX instead. A still more general script would iterate of "$#" (the arguments it is passed), but it might need to validate that the base name of the file matches the format you require/expect.
Adding code to deal with cleaning up on interrupts, or selecting between copy and move, is left as an exercise for the reader. Note that the script makes no attempt to detect whether it has been run on a file before. If you run it three times on the same file, you'll end up with columns 2-4 all containing the date.

Resources