Find multible missing lines in csv using diff - shell

One part of my problem was solved with this answer:
Threadlink
, but an important part of my problem was unsolved!
After using
diff a.csv b.csv | grep -E -A1 '^[0-9]+d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
several times i found something left.
Sometimes multible following lines are deleted.
If only one line was deleted there are something like this found:
3663d3661
For multible lines it is:
3724,3725d3718
So i changed the diff call to:
diff a.csv b.csv | grep -E -A1 '^[0-9]+\,*[0-9]*d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
This works for the first of multiple deleted lines.
My question is:
How could i get all deletet lines (maybe 5 following lines) in such a case?
What did i have to change in the diff call?

diff a.csv b.csv | sed -n '/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/{/^[0-9]\+/d;s/^< //;p}'
will do that.
/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/
will find the string range which are deleted
/^[0-9]\+/d
will delete all 6842d6844,6772
s/^< //
will replace all < in the beginning of lines
and p will print the line.

Related

Genebank files manipulation with bash

I have this genebank file. And I need your help in manipulating it
Iam picking a random part of the file
CDS complement(1750..1956)
/gene="MAMA_L4"
/note="similar to MIMI_L9"
/codon_start=1
/product="hypothetical protein"
/protein_id="AEQ60146.1"
/translation="MHFLDDDNDESNNCFDDKEKARDKIIIDMLNLIIGKKKTSYKCL
DYILSEQEYKFAILSIVENSIFLF"
misc_feature complement(2020..2235)
/note="MAMA_L5; similar to replication origin binding
protein (fragment)"
gene complement(2461..2718)
/gene="MAMA_L6"
CDS complement(2461..2718)
/gene="MAMA_L6"
/codon_start=1
/product="T5orf172 domain-containing protein"
/protein_id="AEQ60147.1"
/translation="MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
IITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID"
My target is to "extract" the info of /translation= and /product= like following
T5orf172 domain-containing protein
MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDVIITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID
*with bold I highlighted the issue that I had.
I am trying to write a bash script so I was thinking to apply something like:
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -w /translation= genebank.file |cut -d= -f2| sed 's/"//'g > File2
paste File1 File2
T the problem is that in the translation entries when I use grep I got only the first line. So it prints until the bold line like
T5orf172 domain-containing protein MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
Can anybody help me to step over this issue? Thank you in advance!
With GNU sed:
sed -En '/^\s*\/(product|translation)="/{
s///
:a
/"$/! { N; s/\n\s*//; ba; }
s/"$//p
}' file |
sed 'N; s/\n/\t/'
Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.
I haven't fully tested this but if you add -A1 to your grep command you'll get one line after the match.
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -A1 -w /translation= genebank.file |cut -d= -f2| sed 's/^ *//g' > File2
paste File1 File2
You would need to delete that extra newline but that should get you close.

Combine multiple text files (row wise) into columns

I have multiple text files that I want to merge columnwise.
For example:
File 1
0.698501 -0.0747351 0.122993 -2.13516
File 2
-5.27203 -3.5916 -0.871368 1.53945
I want the output file to be like:
0.698501, -5.27203
-0.0747351, -3.5916
0.122993, -0.871368
-2.13516, 1.53945
Is there a one line bash common that can accomplish this?
I'll appreciate any help.
---Lyndz
With awk:
awk '{if(NR==1) {split($0,a1," ")} else {split($0,a2," ")}} END{for(i in a2) print a1[i] ", " a2[i]}' file1 file2
Output:
0.698501, -5.27203
-0.0747351, -3.5916
0.122993, -0.871368
-2.13516, 1.53945
paste <(cat file1 | sed -E 's/ +/&,\n/g') <(cat file2 | sed -E 's/ +/&\n/g') | column -s $',' -t | sed -E 's/\s+/, /g' | sed -E 's/, $//g'
It got a bit complicated, but I guess it can be done in a bit simpler way also.
P.S: Please lookup for the man pages of each command to see what they do.

Finding missing IDs

I have two files below which contain each line an ID. However, one of the files contains two IDs less.
$> grep ">" output.racon-1.fasta | wc -l
6492
$ grep ">" output.racon-2.fasta | wc -l
6490
How is possible which two IDs are missing?
FILE 1
$ grep ">" output.racon-1.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-1.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
FILE 2
$ grep ">" output.racon-2.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-2.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
Thank you in advance,
A simple diff with sort could do the job :
diff <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)
As an alternative to using diff you can consider using join. If the files are sorted, this can tell you: (without options) the lines they have in common; using -v1 the lines the first file has that are not present in the second file; using -v2 the lines that are only present in the second file.
So, in your instance, if you believe that the second file is a subset of the first file, you could retrieve the addition lines in the first file with
join -v1 <(grep ">" output.racon-1.fasta) <(grep ">" output.racon-2.fasta)
or (if the files are not sorted already)
join -v1 <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)
[We're using process substitution (the <(...) expressions) to feed the results of your grep commands to join.]
Note however, that if the second file is not a subset of the first, you'll either want to examine the output of the equivalent -v2 lines or take the information from diff.

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Simple diff/patch script for sorted unique file

How could I write a simple diff resp. patch script for applying additions and deletions to a list of lines in a file?
This could be a original file (it is sorted and each line is unique):
a
b
d
a simple patch file could look like this (or somehow as simple):
+ c
+ e
- b
The resulting file should look like (or in any other order, since sort could be applied anyways):
a
c
d
e
The normal patch formats can not be used since they include context, which might alter in this case.
Bash alternatives that read input files only once:
To generate patch you can:
comm -3 a.txt b.txt | sed 's/^\t/+ /;t;s/^/- /'
Because comm delimeters outputs from different files using tab, we can use that tab to detect if line should be added or removed.
To apply patch you can:
{ <patch.txt tee >(grep '^+ ' | cut -c3- >&5) |
grep '^- ' | cut -c3- | comm -13 - a.txt; } 5> >(cat)
The tee splits the input, that is the patch file, into two streams. The first part has + filtered and is outputted to file descriptor 5. The file descriptor 5 is opened to just >(cat) so it is just outputted on stdout. The second part has the minus - filtered and it is joined with a.txt and outputted. Because output should be line buffered, it should work.
A shell solution using comm, awk, and grep to apply such a patch would be:
A=a.txt B=b.txt P=patch.txt; { grep '^-' $P | cut -c 3- | comm -23 $A - ; grep '^+' $P | cut -c 3- } | sort -u > $B
to generate the patch file would be:
A=a.txt B=b.txt P=patch.txt; { comm -13 $A $B | awk '{print "+ " $0}' ; comm -23 $A $B | awk '{print "- " $0}' } > $P
Since nobody could give me an answer, I've created a small python script, which does exactly this job. https://github.com/white-gecko/simplepatch
To apply such a patch call it with (where outfile.txt is generated)
./simplepatch.py -m patch -i infile.txt -p patchfile.txt -o outfile.txt
To generate a patch/diff call it with (where patchfile.txt is generated)
./simplepatch.py -m diff -i infile.txt -o outfile.txt -p patchfile.txt

Resources