Detecting semi-duplicate records in Bash/AWK

Detecting semi-duplicate records in Bash/AWK - bash

Right now I have a script that rifles through tabulated data for cross-referencing record by record (using AWK). But I've run into a problem. AWK is great for line-by-line comparisons to run through formatted data, but I also want to detect semi-duplicate records. Unfortunately, uniq will not work by itself as the record is not 100% carbon-copy.
This is an orderly list, sorted by second and third columns. What I want to detect is the same values in Column 3, 6 and 7
Here's an example:
JJ 0072 0128 V7589 N 22.35 22.35 0.00 Auth
JJ 0073 0128 V7589 N 22.35 22.35 0.00 Auth
The second number is different while the other information is exactly the same, so uniq will not find it solo.
Is there something in AWK that lets me reference the previous line? I already have this code block from AWK going line-by-line. (EDIT awk statement was an older version that was terrible)
awk '{printf "%s", $0; if($6 != $7 && $9 != "Void" && $5 == "N") {printf "****\n"} else {printf "\n"}}' /tmp/verbout.txt

Is there something in AWK that lets me reference the previous line?
No, but there's nothing stopping you from explicitly saving certain info from the last line and using that later:
{
if (last3 != $3 || last6 != $6 || last7 != $7) {
print
} else
handle duplicate here
}
last3=$3
last6=$6
last7=$7
}
The lastN variables all (effectively) default to an empty string at the start then we just compare each line with those and print that line if any are different.
Then we store the fields from that line to use for the next.
That is, of course, assuming duplicates should only be detected if they're consecutive. If you want to remove duplicates when order doesn't matter, you can sort on those fields first.
If order needs to be maintained, you can use an associative array to store the fact that the key has been seen before, something like:
{
seenkey = $3" "$6" "$7
if (seen[seenkey] == 0) {
print
seen[seenkey] = 1
} else {
handle duplicate here
}
}

one way of doing this with awk is
$ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file
this will mark the duplicate records, note that you don't need to sort the file.
if you want to just print the uniq records, the idiomatic way is
$ awk '!a[$3,$6,$7]++' file
again, sorting is not required.

Related

Add location to duplicate names in a CSV file using Bash

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.
id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,
The result should look like this:
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,uspamella#mail.com
3,CY,Korrie,cykorrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
I used AWK to process the csv file.
cat data.csv | awk 'BEGIN {FS=OFS=","};
NR > 1 {
split($3, name)
$4 = tolower($3)
split($4, login)
for (k in login) {
!a[login[k]]++ ? sub(login[k], login[k]"#mail.com", $4) : sub(login[k], tolower($2)login[k]"#mail.com", $4)
}
}; 1' > data_new.csv
The script adds location values only to further duplicates.
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,pamella#mail.com
3,CY,Korrie,korrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
How do I add location to the initial one?

A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.
Notice also that this requires you to avoid the useless use of cat.
awk 'BEGIN {FS=OFS=","};
NR == FNR { ++seen[$3]; next }
FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "#mail.com" }
1' data.csv data.csv >data_new.csv
NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.
Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

How $0 is used in awk, how it works?

read n
awk '
BEGIN {sum=0;}{if( $0%2==0 ){sum+=$0;
}
}
END { print sum}'
Here i add, sum of even numbers and what i want is, initially i give input as how many(count) and then the numbers i wanted to check as even and add it.
eg)
3
6
7
8
output is : 14
here 3 is count and followed by numbers i want to check, the code is executed correctly and output is correct, but i wanted to know how $0 left the count value i.e) 3 and calculates the remaining numbers.

Please update your question to be meaningful: There is no relationship between $0 and the Unix operating system, as choroba already pointed out in his comment. You obviously want to know the meaning of $0 in the awk programming language. From the awk man-page in the section about Fields:
$0 is the whole record, including leading and trailing whitespace.

you're reading the count but not using it in the script,
a rewrite can be
$ awk 'NR==1 {n=$1; next} // read the first value and skip the rest
!($1%2) {sum+=$1} // add up even numbers
NR>n {print sum; exit}' file // done when the # linespass the counter.
in awk, $0 corresponds to the record (here the line), and $i for the fields i=1,2,3...
even number is the one with remainder 0 divided by 2. NR is the line number.

Formatting output using awk

I've a file with following content:
A 28713.64 27736.1000
B 9835.32
C 38548.96
Now, i need to check if the last row in the first column is 'C', then the value of first row in third column should be printed in the third column against 'C'.
Expected Output:
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
I tried below, but it's not working:
awk '{if ($1 == "C") ; print $1,$2,$3}' file_name
Any help is most welcome!!!

This works for the given example:
awk 'NR==1{v=$3}$1=="C"{$0=$0 FS v}7' file|column -t
If you want to append the 3rd column value from A row to C row, change NR==1 into $1=="A"
The column -t part is just for making output pretty. :-)

EDIT: As per OP's comment OP is looking for very first line and looking to match C string at very last line of Input_file, if this is the case then one should try following.
awk '
FNR==1{
value=$NF
print
next
}
prev{
print prev
}
{
prev=$0
prev_first=$1
}
END{
if(prev_first=="C"){
print prev,value
}
else{
print
}
}' file | column -t
Assuming that your actual Input_file is same as shown samples and you want to pick value from 1st column whose value is A.
awk '$1=="A" && FNR==1{value=$NF} $1=="C"{print $0,value;next} 1' Input_file| column -t
Output will be as follows.
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000

POSIX dictates that "assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS."
So...
awk 'NR==1{x=$3} $1=="C"{$3=x} 1' input.txt
Note that the output is not formatted well, but that's likely the case with most of the solutions here. You could pipe the output through column, as Ravinder suggested. Or you could control things precisely by printing your data with printf.
awk 'NR==1{x=$3} $1=="C"{$3=x} {printf "%-2s%-26s%s\n",$1,$2,$3}' input.txt
If your lines can be expressed in a printf format, you'll be able to avoid the unpredictability of column -t and save the overhead of a pipe.

Eliminate useless repeats of values from CSV for line charting

Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9

That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2

Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value

awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2

AWK array parsing issue

My two input files are pipe separated.
File 1 :
a|b|c|d|1|44
File 2 :
44|ab|cd|1
I want to store all my values of first file in array.
awk -F\| 'FNR==NR {a[$6]=$0;next}'
So if I store the above way is it possible to interpret array; say I want to know $3 of File 1. How can I get tat from a[].
Also will I be able to access array values if I come out of that awk?
Thanks

I'll answer the question as it is stated, but I have to wonder whether it is complete. You state that you have a second input file, but it doesn't play a role in your actual question.
1) It would probably be most sensible to store the fields individually, as in
awk -F \| '{ for(i = 1; i < NF; ++i) a[$NF,i] = $i } END { print a[44,3] }' filename
See here for details on multidimensional arrays in awk. You could also use the split function:
awk -F \| '{ a[$NF] = $0 } END { split(a[44], fields); print fields[3] }'
but I don't see the sense in it here.
2) No. At most you can print the data in a way that the surrounding shell understands and use command substitution to build a shell array from it, but POSIX shell doesn't know arrays at all, and bash only knows one-dimensional arrays. If you require that sort of functionality, you should probably use a more powerful scripting language such as Perl or Python.
If, any I'm wildly guessing here, you want to use the array built from the first file while processing the second, you don't have to quit awk for this. A common pattern is
awk -F \| 'FNR == NR { for(i = 1; i < NF; ++i) { a[$NF,i] = $i }; next } { code for the second file here }' file1 file2
Here FNR == NR is a condition that is only true when the first file is processed (the number of the record in the current file is the same as the number of the record overall; this is only true in the first file).

To keep it simple, you can reach your goal of storing (and accessing) values in array without using awk:
arr=($(cat yourFilename |tr "|" " ")) #store in array named arr
# accessing individual elements
echo ${arr[0]}
echo ${arr[4]}
# ...or accesing all elements
for n in ${arr[*]}
do
echo "$n"
done
...even though I wonder if that's what you are looking for. Inital question is not really clear.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Detecting semi-duplicate records in Bash/AWK - bash

one way of doing this with awk is $ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file this will mark the duplicate records, note that you don't need to sort the file. if you want to just print the uniq records, the idiomatic way is $ awk '!a[$3,$6,$7]++' file again, sorting is not required.

Related

Add location to duplicate names in a CSV file using Bash

How $0 is used in awk, how it works?

Formatting output using awk

Eliminate useless repeats of values from CSV for line charting

AWK array parsing issue

Categories

Resources