Finding missing IDs - bash

I have two files below which contain each line an ID. However, one of the files contains two IDs less.
$> grep ">" output.racon-1.fasta | wc -l
6492
$ grep ">" output.racon-2.fasta | wc -l
6490
How is possible which two IDs are missing?
FILE 1
$ grep ">" output.racon-1.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-1.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
FILE 2
$ grep ">" output.racon-2.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-2.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
Thank you in advance,

A simple diff with sort could do the job :
diff <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)

As an alternative to using diff you can consider using join. If the files are sorted, this can tell you: (without options) the lines they have in common; using -v1 the lines the first file has that are not present in the second file; using -v2 the lines that are only present in the second file.
So, in your instance, if you believe that the second file is a subset of the first file, you could retrieve the addition lines in the first file with
join -v1 <(grep ">" output.racon-1.fasta) <(grep ">" output.racon-2.fasta)
or (if the files are not sorted already)
join -v1 <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)
[We're using process substitution (the <(...) expressions) to feed the results of your grep commands to join.]
Note however, that if the second file is not a subset of the first, you'll either want to examine the output of the equivalent -v2 lines or take the information from diff.

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

Simple diff/patch script for sorted unique file

How could I write a simple diff resp. patch script for applying additions and deletions to a list of lines in a file?
This could be a original file (it is sorted and each line is unique):
a
b
d
a simple patch file could look like this (or somehow as simple):
+ c
+ e
- b
The resulting file should look like (or in any other order, since sort could be applied anyways):
a
c
d
e
The normal patch formats can not be used since they include context, which might alter in this case.
Bash alternatives that read input files only once:
To generate patch you can:
comm -3 a.txt b.txt | sed 's/^\t/+ /;t;s/^/- /'
Because comm delimeters outputs from different files using tab, we can use that tab to detect if line should be added or removed.
To apply patch you can:
{ <patch.txt tee >(grep '^+ ' | cut -c3- >&5) |
grep '^- ' | cut -c3- | comm -13 - a.txt; } 5> >(cat)
The tee splits the input, that is the patch file, into two streams. The first part has + filtered and is outputted to file descriptor 5. The file descriptor 5 is opened to just >(cat) so it is just outputted on stdout. The second part has the minus - filtered and it is joined with a.txt and outputted. Because output should be line buffered, it should work.
A shell solution using comm, awk, and grep to apply such a patch would be:
A=a.txt B=b.txt P=patch.txt; { grep '^-' $P | cut -c 3- | comm -23 $A - ; grep '^+' $P | cut -c 3- } | sort -u > $B
to generate the patch file would be:
A=a.txt B=b.txt P=patch.txt; { comm -13 $A $B | awk '{print "+ " $0}' ; comm -23 $A $B | awk '{print "- " $0}' } > $P
Since nobody could give me an answer, I've created a small python script, which does exactly this job. https://github.com/white-gecko/simplepatch
To apply such a patch call it with (where outfile.txt is generated)
./simplepatch.py -m patch -i infile.txt -p patchfile.txt -o outfile.txt
To generate a patch/diff call it with (where patchfile.txt is generated)
./simplepatch.py -m diff -i infile.txt -o outfile.txt -p patchfile.txt

Find multible missing lines in csv using diff

One part of my problem was solved with this answer:
Threadlink
, but an important part of my problem was unsolved!
After using
diff a.csv b.csv | grep -E -A1 '^[0-9]+d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
several times i found something left.
Sometimes multible following lines are deleted.
If only one line was deleted there are something like this found:
3663d3661
For multible lines it is:
3724,3725d3718
So i changed the diff call to:
diff a.csv b.csv | grep -E -A1 '^[0-9]+\,*[0-9]*d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
This works for the first of multiple deleted lines.
My question is:
How could i get all deletet lines (maybe 5 following lines) in such a case?
What did i have to change in the diff call?
diff a.csv b.csv | sed -n '/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/{/^[0-9]\+/d;s/^< //;p}'
will do that.
/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/
will find the string range which are deleted
/^[0-9]\+/d
will delete all 6842d6844,6772
s/^< //
will replace all < in the beginning of lines
and p will print the line.

Resources