Compare execution log's ignoring the execution times - bash

I'm new on linux SO and bash commands and i think someone with more experience could help me. I wanna compare 2 different text files with log's of an execution, but some lines (not all of them) begin with a time' token like this:
12345 ps line 1 content
23456 ps line 2 content
line 3 content
345 ps line 4 content
Those tokens have different values in each log, but, in that comparison, i don't care about them, i wanna just to compare the line contents and ignore them. I could use 'sed' command to generate new files without that tokens and then comepare them, but i pretend to do that repeatedly and could save me some time if i use just one command or one sh file. I've tried to use 'sed' and 'diff' combined, but without success. Would anyone please be able to help me?

You can use the following sed one liner to remove the numbers from the beginning of the file:
sed 's/^[0-9]* ps//g' file1
To diff two such files (less timestamps) you can use process substitution.
diff <(sed 's/^[0-9]* ps//g' file1) <(sed 's/^[0-9]* ps//g' file2)

Untested since you didn't show 2 input files and the expected output but from your description I THINK this would do what you want:
awk '
{ sub(/^[[:digit:]]+[[:space:]]*/,"") }
NR==FNR { file1[FNR] = $0; next }
{ print ($0 == file1[FNR] ? "==" : "!="), $0 }
' file1 file2
If that doesn't do it, post some small sample input and expected output.

Related

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

Fastest way -- Appending a line to a file only if it does not already exist

given this question Appending a line to a file only if it does not already exist
is there a faster way than the solution provided by #drAlberT?
grep -q -F 'string' foo.bar || echo 'string' >> foo.bar
I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).
EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.
EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.
Thanks a lot!
Few suggested improvements:
Try using awk instead of grep so that you can both detect the string and write it in one action;
If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).
Consider this awk for a single line to add:
$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit}
END{if (!f) print line >> FILENAME}' file
Or for multiple lines:
$ awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines file
Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:
$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines words
real 0m0.056s
user 0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob
Edit 2
Try this as being the best of both:
$ time grep -x -f lines words |
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real 0m0.012s
user 0m0.010s
sys 0m0.003s
Explanation:
grep -x -f lines words find the lines that ARE in words
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
>> words append those to the file
Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.
Something along the following lines, perhaps.
awk 'NR==FNR { a[$0]++; next }
$0 in a { delete a[$0] }
1
END { for (k in a) print k }' strings bigfile >bigfile.new
If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.
On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.
This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.
If duplicates are not valid in the file, just append them all and filter out the duplicates:
cat myfile mynewlines | awk '!n[$0]++' > mynewfile
This will allow appending millions of lines in seconds.
If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.
Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).
Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.
#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1;
done < file.txt
while read line; do
if [ x${aa[$line]} == x ]; then
aa[$line]=1;
echo "x$line" >> file.txt
fi
done

Bash: how to optimize/parallelize a search through two large files to replace strings?

I'm trying to figure out a way to speed up a pattern search and replace between two large text files (>10Mb). File1 has two columns with unique names in each row. File2 has one column that contains one of the shared names in File1, in no particular order, with some text underneath that spans a variable number of lines. They look something like this:
File1:
uniquename1 sharedname1
uqniename2 sharedname2
...
File2:
>sharedname45
dklajfwiffwf
flkewjfjfw
>sharedname196
lkdsjafwijwg
eflkwejfwfwf
weklfjwlflwf
My goal is to use File1 to replace the sharedname variables with their corresponding uniquename, as follows:
New File2:
>uniquename45
dklajfwif
flkewjfj
>uniquename196
lkdsjafwij
eflkwejf
This is what I've tried so far:
while read -r uniquenames sharednames; do
sed -i "s/$sharednames/$uniquenames/g" $File2
done < $File1
It works but it's ridiculously slow, trudging through those big files. The CPU usage is the rate-limiting step, so I was trying to parallel the modification to use the 8 cores at my disposal, but couldn't get it to work. I also tried splitting File1 and File2 into smaller chunks and running in batches simultaneously, but I couldn't get that to work, either. How would you implement this in parallel? Or do you see a different way of doing it?
Any suggestions would be welcomed.
UPDATE 1
Fantastic! Great answers thanks to #Cyrus and #JJoao and suggestions by other commentators. I implemented both in my script, on the recommendation of #JJoao to test the compute times, and it's an improvement (~3 hours instead of ~5). However, I'm just doing text file manipulation so I don't see how it should be taking any more than a couple of minutes. So, I'm still working on making better use of the available CPUs, so I'm tinkering with the suggestions to see if I can speed it up further.
UPDATE 2: correction to UPDATE 1
I included the modifications into my script and run it as such, but a chunk of my code was slowing it down. Instead, I ran the suggested bits of code individually on the target intermediary files. Here's what I saw:
Time for #Cyrus' sed to complete
real 70m47.484s
user 70m43.304s
sys 0m1.092s
Time for #JJoao's Perl script to complete
real 0m1.769s
user 0m0.572s
sys 0m0.244s
Looks like I'll be using the Perl script. Thanks for helping, everyone!
UPDATE 3
Here's the time taken by #Cyrus' improved sed command:
time sed -f <(sed -E 's|(.*) (.*)|s/^\2/>\1/|' File1 | tr "\n" ";") File2
real 21m43.555s
user 21m41.780s
sys 0m1.140s
With GNU sed and bash:
sed -f <(sed -E 's|(.*) (.*)|s/>\2/>\1/|' File1) File2
Update:
An attempt to speed it up:
sed -f <(sed -E 's|(.*) (.*)|s/^>\2/>\1/|' File1 | tr "\n" ";") File2
#!/usr/bin/perl
use strict;
my $file1=shift;
my %dic=();
open(F1,$file1) or die("cant find replcmente file\n");
while(<F1>){ # slurp File1 to dic
if(/(.*)\s*(.*)/){$dic{$2}=$1}
}
while(<>){ # for all File2 lines
s/(?<=>)(.*)/ $dic{$1} || $1/e; # sub ">id" by >dic{id}
print
}
I prefer #cyrus solution, but if you need to do that often you can use the previous perl script (chmod + install) as
a dict-replacement command.
Usage: dict-replacement File1 File* > output
It would be nice if you could tell us the time of the various solutions...

Bash grep in file which is in another file

I have 2 files, one contains this :
file1.txt
632121S0 126.78.202.250 1
131145S0 126.178.20.250 1
the other contain this : file2.txt
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
313359S2 126.137.37.250 OBS
I want to end up with a third file which contains :
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
Only the lines which start by the same string in both files. I can't remember how to do it. I tried several grep, egrep and find, i still cannot use it properly...
Can you help please ?
You can use this awk:
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
It is based on the idea of two file processing, by looping through files as this:
first loop through first file, storing the first field in the array a.
then loop through second file, checking if its first field is in the array a. If that is true, the line is printed.
To do this with grep, you need to use a process substitution:
grep -f <(cut -d' ' -f1 file1.txt) file2.txt
grep -f uses a file as a list of patterns to search for within file2. In this case, instead of passing file1 unaltered, process substitution is used to output only the first column of the file.
If you have a lot of these lines, then the utility join would likely be useful.
join - join lines of two files on a common field
Here's a set of examples.

Resources