Bash Parsing Large Log Files - bash

I am new to bash, awk, scripting; so do please help me improve.
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
!bar#foo.com,address
#john#foo.com;address
john#foo.com;address µÖ
email1#foo.com;username;address
email2#foo.com;username
email3#foo.com,username;address [spaces at the start of the row]
email4#foo.com|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("\047", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,$0)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print $0 > "ErrorFile"
}' *.txt

Related

moving files to equivalent folders in bash shell

Im sorry for the very basic question but I am frankly extremely new at bash and can't seem to work out the below. Any help would be appreciated.
In my working directory '/test' I have a number of files named :
mm(a 10 digit code)_Pool_1_text.csv
mm(same 10 digit code)_Pool_2_text.csv
mm(same 10 digit code)_Pool_3_text.csv
how can I write a loop that would take the first file and put it in a folder at :
/this/that/Pool_1/
the second file at :
/this/that/Pool_2/
etc
Thank you :)
Using awk you many not need to create an explicit loop:
awk 'FNR==1 {match(FILENAME,/Pool_[[:digit:]]+/);system("mv " FILENAME " /this/that/" substr(FILENAME, RSTART, RLENGTH) "/")}' mm*_Pool_*.text.csv
the shell glob selects the files (we could use extglob, but I wanted to keep it simple)
awk gets the filenames
we match pool and digit
we move the file using the match to extract the pool name

Looking up and extracting a line from a big file matching the lines of another big file

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)
In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files.
File 1 is a binary compressed 130GB file
File 2 is a tsv file of 1 billion lines
Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).
Does this "problem" have a name in algorithmics ?
What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )
Thank you a lot
Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.
In pseudocode, this looks something like:
currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
if (currentLarge < currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge = currentSmall)
//Bingo!
saveMatchData(currentLarge, currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge > currentsmall)
currentSmall <- readNextRecord(smallFile)
endif
if (largeFile.EOF or smallFile.EOF)
searching <- false
endif
endwhile
Quite how you translate that into awk or bash is beyond my meagre knowledge of either.

awk - skip lines of subdomains if domain already matched

lets assume - there is an already ordered list of domains like:
tld.aa.
tld.aa.do.notshowup.0
tld.aa.do.notshowup.0.1
tld.aa.do.notshowup.0.1.1
tld.aa.do.notshowup.too
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.xxxxx.donotshowup
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
which later acts as a blacklist.
Per specific requirement - all lines with a trailing '.' indicate
that all deeper subdomains of that specific domain should not appear
in the blacklist itself then... so the desired output of the example
above would/should be:
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
I currently run this in a loop (pure bash + heavy use of bash builtins to speedup things) ... but as the list
grows it takes quite long now to process around 562k entries.
Shouldn't it be easy for AWK (or sed maybe) to do this - any help is
really appreciated (I already tried some things in awk but somehow couldn't get it to display what I want...).
Thankyou!
If the . lines always come before the lines to ignore, this awk should do:
$ awk '{for (i in a) if (index($0,i) == 1) next}/\.$/{a[$0]=1}1' file
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
/\.$/{a[$0]=1} adds lines with trailing dot to an array.
{for (i in a) if (index($0,i) == 1) next} searches for the current line in one of these indexed entries and skips further processing if found (next).
If the file is sorted alphabetically and no subdomains end with a dot, you don't even need an array as #Corentin Limier suggests:
awk 'a{if (index($0,a) == 1) next}/\.$/{a=$0}1' file

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Need this in bash.
In a linux directory, I will have a CSV file. Arbitrarily, this file will have 6 rows.
Main_Export.csv
1,2,3,4,8100_group1,6,7,8
1,2,3,4,8100_group1,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8
I need to parse this file's 5th field (first four chars only) and take each row with 8100 (for example) and put those rows in a new file. Same with all other groups that exist, across the entire file.
Each new file can only contain the rows for its group (one file with the rows for 8100, one file for the rows with 3100, etc.)
Each filename needs to have that group# prepended to it.
The first four characters could be any numeric value, so I can't check these against a list - there are like 50 groups, and maintenance can't be done on this if a group # changes.
When parsing the fifth field, I only care about the first four characters
So we'd start with: Main_Export.csv and end up with four files:
Main_Export_$date.csv (unchanged)
8100_filenameconstant_$date.csv
3100_filenameconstant_$date.csv
5400_filenameconstant_$date.csv
I'm not sure the rules of the site. If I have to try this for myself first and then post this. I'll come back once I have an idea - but I'm at a total loss. Reading up on awk right now.
If I have understood well your problem this is very easy...
You can just:
$ awk -F, '{fifth=substr($5, 1, 4) ; print > (fifth "_mysuffix.csv")}' file.cv
or just:
$ awk -F, '{print > (substr($5, 1, 4) "_mysuffix.csv")}' file.csv
And you will get several files like:
$ cat 3100_mysuffix.csv
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
or...
$ cat 5400_mysuffix.csv
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

Resources