How to delete lines with a duplicate numbers

How to delete lines with a duplicate numbers - bash

I want to delete any lines that have the same number at end, for example:
Input:
abc 77777
rgtds 77777
aswa 77777
gdf 845
sdf 845
ytn 963
fgnb 963
Output:
abc 77777
gdf 845
ytn 963
Note: every line with a same number most deleted and one of all the lines that had the same number must stay.
I want to convert this text file to my output:
Input:
c:/files/company/aj/psohz.mp4 905
c:/files/company/rs/oxija.mp4 905
c:/files/company/nw/kzlkg.mp4 905
c:/files/company/wn/wpqov.mp4 905
c:/files/company/qi/jzdjg.mp4 905
c:/files/company/kq/dadfr..mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/fx/jszmn.jpg 7839
c:/files/company/me/plsqx.mp4 7839
c:/files/company/xm/uswjb.mp4 7839
c:/files/company/ay/pnnhu.pdf 8636184
c:/files/company/os/glwou.pdf 8636184
c:/files/company/px/kucdu.pdf 8636184
Output:
c:/files/company/kq/dadfr..mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184

If the same numbers are always grouped together, you can use uniq (tested with the version from GNU coreutils):
uniq -f1 input.txt
-f1 means skip the first field when checking duplicities.
Note that it returns the first element of each group, i.e. psohz instead of dadfr in your example. It's not clear what element of each group you wanted, as you returned the last one from the first group, but the first element of the other groups.
If the same numbers aren't grouped together, use sort to group them together:
sort -k2 -su input.txt
-s means stable, i.e. you'll always get the first element of each group, but the groups won't be sorted in the orginal order in the output
-u means unique
-k2 means use only field 2 in comparisons
If you want the first element of each group with the elements sorted the same as in the input, you can use perl.
perl -ane 'print unless $seen{ $F[1] }++' -- input.txt
-n reads the input line by line
-a splits the input on whitespace into the #F array
every second column is saved as a key in the %seen hash. If you see a number for the first time, the line will be printed, but any following occurrence won't, as $seen{ $F[1] } will be greater than 0, i.e. true.

If you know that there are always just two columns (i.e., no blanks in the filename) and that the lines with the same number are always in the same block, you can use uniq:
$ uniq -f1 infile
c:/files/company/aj/psohz.mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184
-f1 says to ignore the first field when asserting uniqueness.
If you don't know about blanks, and the same numbers might be anywhere in the file, you can use awk:
$ awk '!a[$NF]++' infile
c:/files/company/aj/psohz.mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184
This counts the number of occurrences of the last field of each line, and if that number is zero before incrementing, the line gets printed. It's a compact way of expressing
awk '{ if (a[$NF] == 0) { print; a[$NF] += 1 } }' infile

Related

Remove entire row if in a second column there is a floating number using bash

This is my space delimited file:
bob.txt 32.1 0.99 34 56
ann.txt 35 45 23 45
I would like to remove all rows having a floating number in the second column, so the output is:
ann.txt 35 45 23 45
I tried to use grep, but I do not know how too specify the column in which it should look:
grep -vE '\-?[0-9]+|\-?[0-9]+\.[0-9]+' file.txt > out.txt

Using awk:
awk 'NR==1 || $2 !~ /^[[:digit:]]+[.][[:digit:]]+$/ { print }' file
Search for all lines where the second space delimited field does not have a pattern that matches one or more digits, a space and one or more digits and print the result.

how too specify the column in which it should look:
Match the first first column with a regex.
grep -vE '^[^ ]+ +-?[0-9]+\.'

Parsing key value in an csv file using shell script

Given csv input file
Id Name Address Phone
---------------------
100 Abc NewYork 1234567890
101 Def San Antonio 9876543210
102 ghi Chicago 7412589630
103 GHJ Los Angeles 7896541259
How do we grep/command for the value using the key?
if Key 100, expected output is NewYork

You can try this:
grep 100 filename.csv | cut -d, -f3
Output:
New York
This will search the whole file for the value 100, and return all the values in the 3rd column of the matching rows.

With GNU grep:
grep -Po '^100.....\K...........' file
or shorter:
grep -Po '^100.{5}\K.{11}' file
Output:
NewYork

Awk splits lines by whitespace sequences (by default).
You could use that to write a condition on the first column.
In your example input, it looks like not CSV but columns with fixed width (except the header). If that's the case, then you can extract the name of the city as a substring:
awk '$1 == 100 { print substr($0, 9, 11); }' input.csv
Here 9 is the starting position of the city column, and 11 is its length.
If on the other hand your input file is not what you pasted, but really CSV (comma separated values), and if there are no other embedded commas or newline characters in the input, then you can write like this:
awk -F, '$1 == 100 { print $3 }' input.csv

Use AWK to get unique count of record file based on certain columns

I have an AWK command to modify to get a unique count of a record file based on primary keys. Inside record file, there 21 elements, column 1 and 18 being the PKs. The record is all on one row, the record seperator is \^ and field seperator is |. This is what I have so far but it still is giving me the total # of records in the file but not unique:
awk 'BEGIN{RS="\\^";FS="\\|";} {a[ $1 $18 ]++;}END{print length(a);}' filename
Sample Data:
1|01212121|0|OUTGOING| | | | | |57 OHARE DR|not available|DALLAS|TX|03560|US|1131142334825|1|Jan 15 2004 11:12:06:576AM|Jan 15 2004 2:54:41:226PM|SYSTEM|\^
There are 2 millions rows of this sort of data and I have 30 duplicates.
Expected output should : 1999970

Use GNU awk for multi-char RS and use SUBSEP between your array index component fields to make the result unique:
awk 'BEGIN{RS="\\^"; FS="|"} NF>1{a[$1,$18]} END{print length(a)}' filename
You need the NF>1 test if your input file/line ends with \^\n instead of just \n. We know it does end with \n because you said if I do a wc -l on the file, it will return 1 and wc -l only counts \ns and your 1 sample input line ends in \^ so that all leads me to believe that your file does end with \^\n and so the test for NF>1 is necessary to avoid including the blank record after the final \^.

At least the record separator RS can only hold a single character. Since everthing is congested in a single line you need to choose the last character of your data rows as RS and discard the last field (consisting of \).
Fix it like this:
awk 'BEGIN{RS="^";FS="|"} {a[$1,$18]++} END{print length(a)}' filename
Note that awk will now split on every ^ it encounters in the input. Should you require to split only on \^ it'd suggest the following:
sed 's/\\^/\n/g' filename |awk 'BEGIN{FS="|"} {a[$1,$18]++} END{print length(a)}'
Edit:
Incorporated remarks from #Ed.

unique file based on 2 line match

I have a file that has lines like this. I'd like to uniq this where every unique item consists of the 2 lines. so since
bob
100
is here twice I would only print it the one time. Help please. thanks,
bob
100
bill
130
joe
123
bob
100
joe
120

Try this:
printf "%s %s\n" $(< file) | sort -u | tr " " "\n"
Output:
bill
130
bob
100
joe
120
joe
123
With bash builtins:
declare -A a # declare associative array
while read name; do read value; a[$name $value]=; done < file
printf "%s\n" ${!a[#]} # print array keys
Output:
joe
120
joe
123
bob
100
bill
130

Try sed:
sed 'N;s/\n/ /' file | sort -u | tr ' ' '\n'
N: read next line, and append to current line
;: command separator
s/\n/ /: replace eol with space

I would use awk:
awk 'NR%2{l=$0}!(NR%2){seen[l"\n"$0]}END{for(i in seen)print i}' input
Let me explain the command in a multi-line version:
# On odd lines numbers store the current line in l.
# Note that line numbers starting at 1 in awk
NR%2 {l=$0}
# On even line numbers create an index in a associative array.
# The index is the last line plus the current line.
# Duplicates would simply overwrite themselves.
!(NR%2) {seen[l"\n"$0]}
# After the last line of input has been processed iterate
# through the array and print the indexes
END {for(i in seen)print i}

Print lines where first column matches, second column different

In a text file, how do I print out only the lines where the first column is duplicate but 2nd column is different? I want to reconcile these differences. Possibly using awk/sed/bash?
Input:
Jon AAA
Jon BBB
Ellen CCC
Ellen CCC
Output:
Jon AAA
Jon BBB
Note that the real file is not sorted.
Thanks for any help.

this line should do: (I broke the one-liner into 3 lines for better reading)
awk '!($1 in a) {a[$1]=$2;next}
$1 in a && $2!=a[$1]{p[$1 FS $2];p[$1 FS a[$1]]}
END{for(x in p)print x}' file
the 1st line save $1 $2 into array, if it was checked first time
line2: for existing $1 and different $2, put them (the two lines) into an array p, so that same $1,$2 combination won't be print multiple times.
print the index of array p

sort file | uniq -u
Will only print the unique lines.

This might work for you:
sort file | uniq -u | rev | uniq -Df1 | rev
This sorts the file, removes any duplicate lines, reverses the line, removes and unique lines that don't have the same key (keeps duplicates where the 2nd field is the same) and the reverses the line to its original position.
This will drop duplicate lines and lines with singleton keys.

Just a normal unique sort should work
awk '!a[$0]++' test

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to delete lines with a duplicate numbers - bash

Related

Remove entire row if in a second column there is a floating number using bash

Parsing key value in an csv file using shell script

Use AWK to get unique count of record file based on certain columns

unique file based on 2 line match

Print lines where first column matches, second column different

Categories

Resources