awk script to delete single record, not just group of records - bash

I have an awk command that outputs entries absent from $NEWFILE but found in $OLDFILE:
awk -F "|" 'NR==FNR{a[$4]++}!a[$4]' $NEWFILE $OLDFILE > $OUTFILE
This command works great when all entries for an entity sharing a unique identifier are not found in $NEWFILE. However, it fails when only one entry for the entity, but not all, has been removed from $NEWFILE.
Anyone have a suggestion about how I can tweak this awk command to output all the entries absent from $NEWFILE but found in $OLDFILE, regardless of whether all the entries for an entity are removed?
Sample data: newfile, oldfile

Short and sweet: Use diff. You can diff oldfile newfile | grep '^< ' | cut -b3- to limit the output to what you want.

If I understand you correctly, this is what you want
awk -F "|" 'NR==FNR{a[$1 $2 $3 $4]++}!a[$1 $2 $3 $4]' NEWFILE OLDFILE > OUTFILE
Since NEWFILE don't have the urls present in OLDFILE the unique row identifier is the composite of the four first fields. Because NEWFILE doesn't have those urls a simple diff won't do.

AWK is a line by line interpreter that's reason for only one line being removed and others being in place. You can do two things:
If you can, filter with an expression which is common to lines.
For each line of newfile, run a for loop which will iterate oldfile and do operation for you.

Must you use awk? May we simply employ join instead, which is really what you're doing here, no?
$join -v2 -t'|' -j4 <(sort -t'|' -k4 newfile) <(sort -t'|' -k4 oldfile ) |tee outfile
P-1-01541|22|Professor|University of Alabama at Birmingham|http://www.uab.edu/
P-1-01541|22|Short-Term Scholar|University of Alabama at Birmingham|http://www.uab.edu/
This of course assumes you're joining on column 4, and like most rudimentary joins that you don't want to re-iterate, it must be sorted first.

Related

Bash: sort rows within a file by timestamp

I am new to bash scripting and I have written a script to match regex and output lines to print to a file.
However, each line contains multiple columns, one of which is the timestamp column, which appears in the form YYYYMMDDHHMMSSTTT (to millisecond) as shown below.
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
My code is written as follow:
awk -F "," -v OFS=","'{if($2=="ABC"){print}}' < $i>> "$filename"
How can I modify my code such that it can sort the rows by timestamp (YYYYMMDDHHMMSSTTT) in ascending order before printing to file?
You can use a very simple sort command, e.g.
sort yourfile
If you want to insure sort only looks at the datestamp, you can tell sort to only use the first command separated field as your sorting criteria, e.g.
sort -t, -k1 yourfile
Example Use/Output
With your data save in a file named log, you could do:
$ sort -t, -k1 log
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
Let me know if you have any problems.
Just add a pipeline.
awk -F "," '$2=="ABC"' < "$i" |
sort -n >> "$filename"
In the general case, to sort on column 234. try sort -t, -k234,234n
Notice alse the quoting around "$i", like you already have around "$filename", and the simplifications of the Awk script.
If you are using gawk you can do:
$ awk -F "," -v OFS="," '$2=="ABC"{a[$1]=$0} # Filter lines that have "ABC"
END{ # set the sort method
PROCINFO["sorted_in"] = "#ind_num_asc"
for (e in a) print a[e] # traverse the array of lines
}' file
An alternative is to use sed and sort:
sed -n '/^[0-9]*,ABC,/p' file | sort -t, -k1 -n
Keep in mind that both of these methods are unrelated to the shell used. Bash is just executing the tools (sed, awk, sort, etc) that are otherwise part of the OS.
Bash itself could do the sort in pure Bash but it would be long and slow.

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp
You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.
Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. Voilà!
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

BASH Substracting Files on Key line by line

I just wanna to substract one CSV-File from another one, but not if the lines are the same. Instead of comparing the lines I'd like to look if the lines matching in one field.
e.g. the first file
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
bar#foo.com;Bar;Ms;2
and the second file
EMAIL;NAME
foo#bar.com;Foo
the resultfile should be
EMAIL;NAME;SALUTATION;ID
bar#foo.com;Bar;Ms;2
I think u know what I mean ;)
How is that possible in bash? It's easy for me doing this in Java, but I realy like to learn how to do that in bash. Also I can substract by comparing the lines using sort
#! / bin / bash
echo "Substracting Files..."
sort "/tmp/list1.csv" "/tmp/list2.csv" "/tmp/list2.csv" | uniq -u >> /tmp/subList.csv
echo "Files successfully substracted."
But the lines arn't the same tuple. So I have to compare line with keys.
Any suggestions? Thanks a lot.. Nils
One possible solution coming to my mind is this one (working with bash):
grep -v -f <(cut -d ";" -f1 /tmp/list2.csv) /tmp/list1.csv
That means:
cut -d ";" -f1 /tmp/list2.csv: Extract the first column of the second file.
grep -f some_file: Use a file as pattern source.
<(some_command): This is a process substitution. It executes the command and feeds the output to a named pipe which then can be used as file input to grep -f.
grep -v: Print only the lines not matching the pattern(s).
Update: the solution to the question, via join and awk.
join --header -1 1 -2 1 -t";" --nocheck-order -v 1 1.csv 2.csv | | awk 'NR==1 {print gensub(";[^;]\\+$","","g");next} 1'
These were the inverse answers:
$ join -1 1 -2 1 -t";" --nocheck-order -o 1.1,1.2,1.3,1.4 1.csv 2.csv
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
join to the rescue.
Or the skipping of printing the NAME field without -o:
$ join -1 1 -2 1 -t";" --nocheck-order 1.csv 2.csv | awk 'BEGIN {FS=";" ; OFS=";"} {$NF=""; print }'
(But it still prints a plus ;˛after the last field.
HTH

Bash: sort text file by last field value

I have a text file containing ~300k rows. Each row has a varying number of comma-delimited fields, the last of which is guaranteed numerical. I want to sort the file by this last numerical field. I can't do:
sort -t, -n -k 2 file.in > file.out
as the number of fields in each row is not constant. I think sed, awk maybe the answer, but not sure how. E.g:
awk -F, '{print $NF}' file.in
gives me the last column value, but how to use this to sort the file?
Use awk to put the numeric key up front. $NF is the last field of the current record. Sort. Use sed to remove the duplicate key.
awk -F, '{ print $NF, $0 }' yourfile | sort -n -k1 | sed 's/^[0-9][0-9]* //'
vim file.in -c '%sort n /.*,\zs/' -c 'saveas file.out' -c 'q'
Maybe reverse the fields of each line in the file before sorting? Something like
perl -ne 'chomp; print(join(",",reverse(split(","))),"\n")' |
sort -t, -n -k1 |
perl -ne 'chomp; print(join(",",reverse(split(","))),"\n")'
should do it, as long as commas are never quoted in any way. If this is a full-fledged CSV file (in which commas can be quoted with backslash or space) then you need a real CSV parser.
Perl one-liner:
#lines=<STDIN>;foreach(sort{($a=~/.*,(\d+)/)[0]<=>($b=~/.*,(\d+)/)[0]}#lines){print;}
I'm going to throw mine in here as an alternative (and I couldn't get awk to work) :)
sample file:
Call of Doody 1322
Seam the Ripper 1329
Mafia Bots 1 1109
Chicken Fingers 1243
Batup Light 1221
Hunter F Tomcat 1140
Tober 0833
code:
for i in `sed -e 's/.* \(\d\)*/\1/' file.txt | sort`; do grep $i file.txt; done > file_sort.txt
Python one-liner:
python -c "print ''.join(sorted(open('filename'), key=lambda l: int(l.split(',')[-1])))"

Resources