awk - skip lines of subdomains if domain already matched - bash

lets assume - there is an already ordered list of domains like:
tld.aa.
tld.aa.do.notshowup.0
tld.aa.do.notshowup.0.1
tld.aa.do.notshowup.0.1.1
tld.aa.do.notshowup.too
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.xxxxx.donotshowup
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
which later acts as a blacklist.
Per specific requirement - all lines with a trailing '.' indicate
that all deeper subdomains of that specific domain should not appear
in the blacklist itself then... so the desired output of the example
above would/should be:
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
I currently run this in a loop (pure bash + heavy use of bash builtins to speedup things) ... but as the list
grows it takes quite long now to process around 562k entries.
Shouldn't it be easy for AWK (or sed maybe) to do this - any help is
really appreciated (I already tried some things in awk but somehow couldn't get it to display what I want...).
Thankyou!

If the . lines always come before the lines to ignore, this awk should do:
$ awk '{for (i in a) if (index($0,i) == 1) next}/\.$/{a[$0]=1}1' file
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
/\.$/{a[$0]=1} adds lines with trailing dot to an array.
{for (i in a) if (index($0,i) == 1) next} searches for the current line in one of these indexed entries and skips further processing if found (next).
If the file is sorted alphabetically and no subdomains end with a dot, you don't even need an array as #Corentin Limier suggests:
awk 'a{if (index($0,a) == 1) next}/\.$/{a=$0}1' file

Related

remove line in csv file if string found (from another text file) in bash

Due to a power failure issue, I am having to clean up jobs which are run based on text files. So the problem is, I have a text file with strings like so (they are uuids):
out_file.txt (~300k entries)
<some_uuidX>
<some_uuidY>
<some_uuidZ>
...
and a csv like so:
in_file.csv (~500k entries)
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location3/,<some_uuidX>.json.<some_string3>
/path/to/some/location4/,<some_uuidY>.json.<some_string4>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
/path/to/some/location6/,<some_uuidZ>.json.<some_string6>
...
I would like to remove lines from out_file for entries which match in_file.
The end result:
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
...
Since the file sizes are fairly large, I was wondering if there is an efficient way to do it in bash.
any tips would be geat.
Here is a potential grep solution:
grep -vFwf out_file.txt in_file.csv
And a potential awk solution (likely faster):
awk -F"[,.]" 'FNR==NR { a[$1]; next } !($2 in a)' out_file.txt in_file.csv
NB there are caveats to each of these approaches. Although they both appear to be suitable for your intended purpose (as indicated by your comment "the numbers add up correctly"), posting a minimal, reproducible example in future questions is the best way to help us help you.

Looking up and extracting a line from a big file matching the lines of another big file

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)
In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files.
File 1 is a binary compressed 130GB file
File 2 is a tsv file of 1 billion lines
Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).
Does this "problem" have a name in algorithmics ?
What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )
Thank you a lot
Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.
In pseudocode, this looks something like:
currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
if (currentLarge < currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge = currentSmall)
//Bingo!
saveMatchData(currentLarge, currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge > currentsmall)
currentSmall <- readNextRecord(smallFile)
endif
if (largeFile.EOF or smallFile.EOF)
searching <- false
endif
endwhile
Quite how you translate that into awk or bash is beyond my meagre knowledge of either.

Add missing columns to CSV file?

Starting Question
I have a CSV file which is formed this way (variable.csv)
E,F,G,H,I,J
a1,
,b2,b3
c1,,,c4,c5,c6
As you can see, the first and second columns do not have all the commas needed. Here's what I want:
E,F,G,H,I,J
a1,,,,,
,b2,b3,,,
c1,,,c4,c5,c6
With this, now every row has the right number of columns. In other words, I'm looking for a unix command which smartly appends the correct number of commas to the end of each row to make the row have the number of columns that we expect, based off the header.
Here's what I tried, based off of some searching:
awk -F, -v OFS=, 'NF=6' variable.csv. This works in the above case, BUT...
Final Question
...Can we have this command work if the column data contains commas itself, or even new line characters? e.g.
E,F,G,H,I,J
"a1\n",
,b2,"b3,3"
c1,,,c4,c5,c6
to
E,F,G,H,I,J
"a1\n",,,,,
,b2,"b3,3",,,
c1,,,c4,c5,c6
(Apologies if this example's formatting is malformed due to the way the newline is represented.
Short answer:
python3 -c 'import fileinput,sys,csv;b=list(csv.reader(fileinput.input()));w=max(len(i)for i in b);print("\n".join([",".join(i+[""]*(w-len(i)))for i in b]))' variable.csv
The python script may be long, but this is to ensure that all cases are handled. To break it down:
import fileinput,csv
b=list(csv.reader(fileinput.input())) # create a reader obj
w=max(len(i)for i in b) # how many fields?
print("\n".join([",".join(i+[""]*(w-len(i)))for i in b])) # output
BTW, in your starting problem
awk -F, -v OFS=, 'NF<6{$6=""}1' variable.csv
should work. (I think it's implementation or version related. Your code works on GNU awk but not on Mac version.)

Bash Parsing Large Log Files

I am new to bash, awk, scripting; so do please help me improve.
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
!bar#foo.com,address
#john#foo.com;address
john#foo.com;address µÖ
email1#foo.com;username;address
email2#foo.com;username
email3#foo.com,username;address [spaces at the start of the row]
email4#foo.com|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("\047", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,$0)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print $0 > "ErrorFile"
}' *.txt

gsub issue with awk (gawk)

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

Resources