If Partial Duplicate on Line, Remove Line - bash

I have a file with 400+ lines, but some of the lines have partial duplicates. Below is a simplified version.
file.txt:
A_12_23 A_12_34 B_12_23 B_12_34
A_1_34 A_23_34 B_1_12 B_1_23
The fields are whitespace-separated where the letter before the first underscore is an identifier and the values after the first underscore are its values. A partial duplicate is one where one of the fields for A has the same values after the underscore as one of the B fields. The lines are sorted so that the A fields are always before the B fields. There are no other identifiers.
What I would like to do is remove any line with a partial duplicate.
output.txt:
A_1_34 A_23_34 B_1_12 B_1_23
How would I go about doing this? I know how to remove exact duplicates on a line by:
awk '$1!=$2' file.txt > output.txt # Can use various combinations if needed
I am not sure about about partial duplicates. For example: 12_23 is repeated two times on the first line, so I want it deleted. Stopping at deleting duplicated partial strings is okay since it will also delete if repeated more.
Please let me know how I can improve this question. Thanks in advance!

Slightly generalizing the answer by malarres, here is a regex which looks for any value after A which also occurs after B, followed by space or newline. The number of digit groups in each field is arbitrary, but this does assume that all A values are before all B values, and that these tokens only occur at the beginning of a field.
grep -Ev 'A_([^_ ]+(_[^ _]+)*) (.* )?B_\1( |$)'

Rather than awk you can use grep for that
$ grep -v -E '._(.._..).*\1' file.txt
-v to print lines NOT matching
'._(.._..).*\1' looks for repetitions of the pattern .._..

Exclude first two characters of each field and check for duplicates, if not any, print the line. You can modify the last argument of substr to exlude any number of initial chars.
awk '{delete a; for (i=1;i<=NF;i++) if (a[substr($i,3)]++) next} 1' file

Related

Adding variable amount of lines to file in Bash

I have a file with lines of a format XXXXXX_N where N is some number. For example:
41010401_1
42023920_3
45788_1
I would like to add N-1 lines before every line where N>1 such that I have lines for the specified XXXX value with all N values up to and including the original N:
41010401_1
42023920_1
42023920_2
42023920_3
45788_1
I thought about doing it with sed but I'm not sure how to conditionally append different amount of lines with different value which is based on what sed reads.
Is sed even the correct command to deal with this problem?
Any help would be appreciated.
One way in awk is to set field separators to underscore and print all missing records when 2nd field is greater than 1 in a loop like below.
$ awk 'BEGIN{FS=OFS="_"} $2>1{for(i=1;i<$2;i++) print $1,i} 1' file
41010401_1
42023920_1
42023920_2
42023920_3
45788_1

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

Replace string after first semicolon while retaining the string after that

I have a result file, values separated by ; as below:
137;AJP14028.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
and I want to change the second value (AJP14028.1_VP35) to only AJP14028, without the ".1_VP35" at the back. So the result will be:
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Any idea on how to do this? I am trying to solve this using either sed or awk but I am not really familiar with them yet.
With that input, and focusing on the second field, you can use awk:
$ awk 'BEGIN{FS=OFS=";"} {split($2, arr, /\.1/); $2=arr[1]} 1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Explanation:
BEGIN{FS=OFS=";"} sets FS and OFS to ";". This splits the input on the ; character and set the output field separator to that same character.
{split($2, arr, /\.1/) splits the second field on the pattern of a literal .1 and places the result in an array.
$2=arr[1] is an awk idiom that resets the second field, $2, to the trimmed value. A side effect is the total record, $0 is reset using the output field separator, OFS
1 at the end is another awkism -- print the current record.
If you just have the fixed string .1_VP35 to remove (and you do not care if it is field specific) you can just used sed:
sed 's/\.1_VP35//' file
awk '{sub(/.1_VP35/,"")}1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
sed -r 's/(^[^.]*)(.[^;]*)(.*)/\1\3/g' inputfile
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Here: back referencing is used to divide the input line into three groups,seprated by `()'. Later they are referred as "\1" and so on.
The first group will match from the start of the line till the first dot.
The second group will match string followed by the first dot till the first semicolon.
The third group will match everything followed by it.
This might work for you (GNU sed):
sed 's/\(;[^.]*\)[^;]*/\1/' file
Make a back reference of the first ; and everything thereafter which is not a . and then remove everything from thereon which is not a ;.

Remove rows with too many delimiters

I have a file with fields separated by the '`' character. But sometimes the actual data also contains this character. How can I remove all the erroneous rows and retain only the good quality data.
Sample Row as below . Towards the end 'fff`ff' this is the erroneous column . in such case The row should be eliminated.
xxx`1000165811`2012`2012_q2`05/09/2012 22:02:00`1343`04/07/2004 00:00:00`05/09/2012 00:00:00````F`1`1.000000`9.620000`1.0000````fff`Not`Free`Free`1.000000`9.620000`0.000000`1.0000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`56565666`255.590000`21`0`0.000000```ddd`dddd`FA May 2012 ddd`0.000000`0.000000`0.000000`0.000000`0.000000`05/30/2012 00:00:00`05/30/2012 00:00:00`1.000000`ddd`ddd`OW`DL`dd dd dd`ddd`dd`dd dd`dd dd`0.000000`0.000000``````````0.000000`````````Non_Mobile`9.620000`1.000000`1`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`9.620000`9.620000`0.000000`0.000000`0.000000`0.000000`28.590000`6.990000`**fff`ff**`````````9.620000`1.000000`1
You need to know what the correct number of delimiters in a line is. You need to count the actual number of delimiters in each line, and reject those lines where the actual count is not the correct number.
Assuming the the correct number of separators is n=5, then you could try:
n=5
grep -E '^[^`]*(`[^`]*){'"$n"'}$' data
The regex uses extended regular expressions (-E). The regex matches the start of the line, zero or more non-back-ticks, then a sequence of n occurrences of a back tick followed by zero or more non-back-ticks, followed by the end of line. Because the back-tick is a shell metacharacter, it is best to enclose most of the regular expression in single quotes. The variable $n could be used without the double quotes around it, but it's generally best to enclose variables in double quotes. Clearly, you can also use this version too:
grep -E '^([^`]*`){'"$n"'}[^`]*$' data
Given a data file data:
AA`BB`CC`DD`EE`FF
AABB`CC`DD`EE`FF
A`A`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
``````
````
Welcome`to`the`land`of`insanity
The output of the command is:
AA`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
Welcome`to`the`land`of`insanity
grep -v "[^`]`[^`]`[^`]`"
you need to have one more times that the correct lines would have
In the spirit of "Be careful what you ask for", here is a "one-liner" (spread over three lines for readability) that will do what was asked, using only awk and assuming that $FILE is the relevant filename.
awk -F'`' -v file="$FILE" '
BEGIN{ while(getline<file){if (min==""||NF<min){min=NF}}}
NF==min' "$FILE"
This incantation first determines the minimum number of delimiters per line (without sorting the file), and then rejects all lines with more than that many.
(This is similar to Ed Morton's proposal, but without the bug :-)

How to make awk ignore the field delimiter inside double quotes? [duplicate]

This question already has answers here:
Escaping separator within double quotes, in awk
(3 answers)
Closed 7 years ago.
I need to delete 2 columns in a comma seperated values file.
Consider the following line in the csv file:
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
Now, the result I want at the end:
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
I used the following command:
awk 'BEGIN{FS=OFS=","}{print $1,$4}'
But the embedded comma which is inside quotes is creating a problem, Following is the result I am getting:
"abc#xyz.com,field3
"def#xyz.com",field4
Now my question is how do I make awk ignore the "," which are inside the double quotes?
From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.
This is not a bash/awk solution, but I recommend CSVKit, which can be installed by pip install csvkit. It provides a collection of command line tools to work specifically with CSV, including csvcut, which does exactly what you ask for:
csvcut --columns=1,4 <<EOF
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
EOF
Output:
"abc#xyz.com,www.example.com",field4
def#xyz.com,field4
It strips the unnecessary quotes, which I suppose shouldn't be a problem.
Read the docs of CSVKit here on RTD. ThoughtBot has a nice little blog post introducing this tool, which is where I learnt about CSVKit.
In your sample input file, it is the first field and only the first field, that is quoted. If this is true in general, then consider the following as a method for deleting the second and third columns:
$ awk -F, '{for (i=1;i<=NF;i++){printf "%s%s",(i>1)?",":"",$i; if ($i ~ /"$/)i=i+2};print""}' file
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
As mentioned in the comments, awk does not natively understand quoted separators. This solution works around that by looking for the first field that ends with a quote. It then skips the two fields that follow.
The Details
for (i=1;i<=NF;i++)
This starts a for over each field i.
printf "%s%s",(i>1)?",":"",$i
This prints field i. If it is not the first field, the field is preceded by a comma.
if ($i ~ /"$/)i=i+2
If the current field ends with a double-quote, this then increments the field counter by 2. This is how we skip over fields 2 and 3.
print""
After we are done with the for loop, this prints a newline.
This awk should work regardless of where the quoted field is and works on escaped quotes as well.
awk '{while(match($0,/"[^"]+",|([^,]+(,|$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
Input
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
field1,"abc#xyz.com,www.example.com",field3,field4
Output
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
field1,field4
It even works on
field1,"field,2","but this field has ""escaped"\" quotes",field4
That the mighty FPAT variable fails on !
Explanation
while(match($0,/"[^"]+",|([^,]+(,|$))/,a))
Starts a while loop that continues as long as the match is a success(i.e there is a field).
The match matches the first occurence of the regex which incidentally matches the fields and store it in array a
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]
Sets $0 to begin at the end of matched field and adds the matched field to the corresponding array position in b.
print b[1] b[4];x=0}
Prints the fields you want from b and sets x back to zero for the next line.
Flaws
Will fail if field contains both escaped quotes and a comma
Edit
Updated to support empty fields
awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file

Resources