base64 decode while ignoring brackets - bash

I'm trying to decode a file, which is mostly encoded with base64. What I want to do is to decode the following, while still maintaining the [_*_].
example.txt
wq9cXyjjg4QpXy/Crwo=
[_NOTBASE64ED_]
aGkgdGhlcmUK
[_CONSTANT_]
SGVsbG8gV29ybGQhCg==
Sometimes it'll be in this form
aGkgdGhlcmUK[_CONSTANT_]SGVsbG8gV29ybGQhCg==
Desired output
¯\_(ツ)_/¯
[_NOTBASE64ED_]
hi there
[_CONSTANT_]
Hello World!
hi there[_CONSTANT_]Hello World!
Error output
¯\_(ツ)_/¯
4��!:�#�H\�B�8ԓ��[��ܛBbase64: invalid input
What I've tried
base64 -di example.txt
base64 -d example.txt
base64 --wrap=0 -d -i example.txt
I tried to individually base64 the [_*_] using grep -o. Then find and
replacing them through a weird arrangement with arrays, but I couldn't
get it to work.
base64ing it all, then decoding. Results in double base64ed rows.
The file is significantly downsized!
Encoded using base64 --wrap=0, while loop, and if/else statement.
The [_*_] still need to be there after being decoded.

I am sure someone has a more clever solution than this. But try this
#! /bin/bash
MYTMP1=""
function printInlineB64()
{
local lines=($(echo $1 | sed -e 's/\[/\n[/g' -e 's/\]/]\n/g'))
OUTPUT=""
for line in "${lines[#]}"; do
MYTMP1=$(base64 -d <<< "$line" 2>/dev/null)
if [ "$?" != "0" ]; then
OUTPUT="${OUTPUT}${line}"
else
OUTPUT="${OUTPUT}${MYTMP1}"
fi;
done
echo "$OUTPUT"
}
MYTMP2=""
function printB64Line()
{
local line=$1
# not fully base64 line
if [[ ! "$line" =~ ^[A-Za-z0-9+/=]+$ ]]; then
printInlineB64 "$line"
return
fi;
# likely base64 line
MYTMP2=$(base64 -d <<< "$line" 2>/dev/null)
if [ "$?" != "0" ]; then
echo $line
else
echo $MYTMP2
fi;
}
FILE=$1
if [ -z "$FILE" ]; then
echo "Please give a file name in argument"
exit 1;
fi;
while read line; do
printB64Line "$line"
done < ${FILE}
and here is output
$ cat example.txt && echo "==========================" && ./base64.sh example.txt
wq9cXyjjg4QpXy/Crwo=
[_NOTBASE64ED_]
aGkgdGhlcmUK
[_CONSTANT_]
SGVsbG8gV29ybGQhCg==
==========================
¯\_(ツ)_/¯
[_NOTBASE64ED_]
hi there
[_CONSTANT_]
Hello World!
$ cat example2.txt && echo "==========================" && ./base64.sh example2.txt
aGkgdGhlcmUK[_CONSTANT_]SGVsbG8gV29ybGQhCg==
==========================
hi there[_CONSTANT_]Hello World!

You need a loop that reads each line and tests whether it's base64 or non-base64, and processes it appropriately.
while read -r line
do
case "$line" in
\[*\]) echo "$line" ;;
*) base64 -d <<< "$line" ;;
esac
done << example.txt

I would suggest using other languages other than sh but here is a solution using cut. This would handle the case where there are more than one [_constant_] in a line.
#!/bin/bash
function decode() {
local data=""
local line=$1
while [[ -n $line ]]; do
data=$data$(echo $line | cut -d[ -f1 | base64 -d)
const=$(echo $line | cut -d[ -sf2- | cut -d] -sf1)
[[ -n $const ]] && data=$data[$const]
line=$(echo $line | cut -d] -sf2-)
done
echo "$data"
}
while read -r line; do
decode $line
done < example.txt

If Perl is an option, you can say something like:
perl -MMIME::Base64 -lpe '$_ = join("", grep {/^\[/ || chomp($_ = decode_base64($_)), 1} split(/(?=\[)|(?<=\])/))' example.txt
The code below is equivalent to the above but is broken down into steps for the explanation purpose:
#!/bin/bash
perl -MMIME::Base64 -lpe '
#ary = split(/(?=\[)|(?<=\])/, $_);
foreach (#ary) {
if (! /^\[/) {
chomp($_ = decode_base64($_));
}
}
$_ = join("", #ary);
' example.txt
-MMIME::Base64 option loads the base64 codec module.
-lpe option makes Perl bahave like AWK to loop over input lines and implicitly handle newlines.
The regular expression (?=\[)|(?<=\]) matches the boundary between the base64 block and the maintaining block surrounded by [...].
The split function divides the line into blocks on the boundary and store them in an array.
Then loop over the array and decode the base64-encoded entry if found.
Finally merge the substring blocks into a line to print.

Related

Bash - Extract Matching String from GZIP Files Is Running Very Slow

Complete novice in Bash. Trying to iterate thru 1000 gzip files, may be GNU parallel is the solution??
#!/bin/bash
ctr=0
echo "file_name,symbol,record_count" > $1
dir="/data/myfolder"
for f in "$dir"/*.gz; do
gunzip -c $f | while read line;
do
str=`echo $line | cut -d"|" -f1`
if [ "$str" == "H" ]; then
if [ $ctr -gt 0 ]; then
echo "$f,$sym,$ctr" >> $1
fi
ctr=0
sym=`echo $line | cut -d"|" -f3`
echo $sym
else
ctr=$((ctr+1))
fi
done
done
Any help to speed the process will be greatly appreciated !!!
#!/bin/bash
ctr=0
export ctr
echo "file_name,symbol,record_count" > $1
dir="/data/myfolder"
export dir
doit() {
f="$1"
gunzip -c $f | while read line;
do
str=`echo $line | cut -d"|" -f1`
if [ "$str" == "H" ]; then
if [ $ctr -gt 0 ]; then
echo "$f,$sym,$ctr"
fi
ctr=0
sym=`echo $line | cut -d"|" -f3`
echo $sym >&2
else
ctr=$((ctr+1))
fi
done
}
export -f doit
parallel doit ::: *gz 2>&1 > $1
The Bash while read loop is probably your main bottleneck here. Calling multiple external processes for simple field splitting will exacerbate the problem. Briefly,
while IFS="|" read -r first second third rest; do ...
leverages the shell's built-in field splitting functionality, but you probably want to convert the whole thing to a simple Awk script anyway.
echo "file_name,symbol,record_count" > "$1"
for f in "/data/myfolder"/*.gz; do
gunzip -c "$f" |
awk -F "\|" -v f="$f" -v OFS="," '
/H/ { if(ctr) print f, sym, ctr
ctr=0; sym=$3;
print sym >"/dev/stderr"
next }
{ ++ctr }'
done >>"$1"
This vaguely assumes that printing the lone sym is just for diagnostics. It should hopefully not be hard to see how this can be refactored if this is an incorrect assumption.

Bash - How can I execute a variable

I am reading a file with lines like:
folder=abc
name=xyz
For some lines line I would like set a variable e.g name=xyz corresponding to the line I have read.
Cutting it down, with name=xyz and folder=abc, I have tried:
while read -r line; do
$line
echo $name
done < /etc/testfile.conf
This gives an error message ./test: line 4: folder=abc: command not found etc.
I have tried "$line" and $($line) and it is the same. Is it possible to do what I whant?
I have succeeded by doing:
while read -r line; do
if [[ "$line" == 'folder'* ]]; then
folder="$(echo "$line" | cut -d'=' -f 2)"
fi
if [[ "$line" == 'name'* ]]; then
name="$(echo "$line" | cut -d'=' -f 2)"
fi
done < /etc/testfile.conf
but this seems messy
for your sample, declare is the safest option:
while read -r line; do
declare "$line"
done
$ echo "$folder"
abc
$ echo "$name"
xyz
Direct approach, use eval.
Different approach, try with source or .:
$ echo "$line"
folder=abc
$ . <(echo "$line")
$ echo "$folder"
abc
But probably the good answer will be to tackle the problem in a different way.
You can clean up your approach a bit without resorting to eval.
while IFS="=" read -r name value; do
case $name in
folder) folder=$value ;;
name) name=$value ;;
esac
done < /etc/testfile.conf
why not only source de file ?
$ . infile ; echo "$name"
xyz

Efficient way to add/ append huge files

Below is a shell script that is written to process a huge file. It typically reads a fixed length file line by line, perform substring and append into another file as a delimited file. It works perfectly, but it is too slow.
array=() # Create array
while IFS='' read -r line || [[ -n "$line" ]] # Read a line
do
coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
if [[ -z "${coOrdinates// }" ]];
then
echo "Not adding"
else
array+=("$coOrdinates")
fi
done < "$1_CTRL.txt"
while read -r line;
do
result='"'
for e in "${array[#]}"
do
SUBSTRING1=`echo "$e" | sed 's/.*://'`
SUBSTRING=`echo "$e" | sed 's/:.*//'`
result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
result=$result$result1'"'',''"'
done
echo $result >> $1_1.txt
done < "$1.txt"
Earlier, i had used the cut command and changed as above, but there is no improvement in the time taken.
Can please suggest what kind of changes can be done to improve the time taken for processing..
Thanks in advance
Update:
Sample content of the input file :
XLS01G702012 000034444132412342134
Control File :
OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
load data
CHARACTERSET 'UTF8'
TRUNCATE
into table icm_rls_clientrel2_hg
trailing nullcols
(
APP_ID POSITION(1:3) "TRIM(:APP_ID)",
RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
)
Output file:
"LS0","1G702012 0000"
perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
# read the control file
my $ctrl;
{
local $/ = "";
open my $fh, "<", shift #ARGV;
$ctrl = <$fh>;
close $fh;
}
my #positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );
# read the data file
open my $fh, "<", shift #ARGV;
while (<$fh>) {
my #words;
for (my $i = 0; $i < scalar(#positions); $i += 2) {
push #words, substr($_, $positions[$i], $positions[$i+1]);
}
say join ",", map {qq("$_")} #words;
}
close $fh;
perl parse.pl x_CTRL.txt x.txt
"LS0","1G702012 00003"
Different results from what you requested:
in the POSITION(m:n) syntax of the control file, is n a length or an
index?
in the data file, are those spaces or tabs?
I suggest, with pure bash and to avoid subshells:
if [[ $line =~ POSITION ]] ; then # grep POSITION
coOrdinates="${line#*(}" # cut -d'(' -f2
coOrdinates="${coOrdinates%)*}" # cut -d')' -f1
coOrdinates="${coOrdinates/:/ }" # cut -d':' -f1,2
if [[ -z "${coOrdinates// }" ]]; then
echo "Not adding"
else
array+=("$coOrdinates")
fi
fi
more efficient, by gniourf_gniourf :
if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then
array+=( "${BASH_REMATCH[*]:1:2}" )
fi
similarly:
SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )
# to confirm, I don't know perl substr
result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )
#result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
# trim, if nécessary?
result1="${result1%${result1##*[^[:space:]]}}" # right
result1="${result1#${result1%%[^[:space:]]*}}" # left
gniourf_gniourf suggest having the grep out of the loop:
while read ...; do
...
done < <(grep POSITION ...)
for extra efficiency: while/read loops are very slow in Bash, so prefiltering as much as possible will speed up the process quite a lot.
Updated Answer
Here is a version where I parse the control file with awk, save the character positions and then use those when parsing the input file:
awk '
/APP_ID/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split($0,a,":") # Extract the two character positions separated by colon into array "a"
next
}
/RELATIONSHIP/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split($0,b,"[():]") # Extract character positions into array "b"
next
}
FNR==NR{next}
{ f1=substr($0,a[1]+1,a[2]); f2=substr($0,b[1]+1,b[2]); printf("\"%s\",\"%s\"\n",f1,f2)}
' ControlFile InputFile
Original Answer
Not a complete, rigorous answer, but this should give you an idea of how to do the extraction with awk once you have the POSITION parameters from the control file:
awk -v a=2 -v b=3 -v c=5 -v d=21 '{f1=substr($0,a,b); f2=substr($0,c,d); printf("\"%s\",\"%s\"\n",f1,f2)}' InputFile
Sample Output
"LS0","1G702012 00003"
Try running that on your large input file to get an idea of the performance, then tweak the output. Reading the control file is not at all time-critical so don't bother with optimising that.
To avoid the (slow) while loop , you can use cut and paste
#!/bin/bash
inFile=${1:-checkHugeFile}.in
ctrlFile=${1:-checkHugeFile}_CTRL.txt
outFile=${1:-checkHugeFile}.txt
cat /dev/null > $outFile
typeset -a array # Create array
while read -r line # Read a line
do
coOrdinates="${line#*(}"
coOrdinates="${coOrdinates%%)*}"
[[ -z "${coOrdinates// }" ]] && { echo "Not adding"; continue; }
array+=("$coOrdinates")
done < <(grep POSITION "$ctrlFile" )
echo coOrdinates: "${array[#]}"
for e in "${array[#]}"
do
nr=$((nr+1))
start=${e%:*}
len=${e#*:}
from=$(( start + 1 ))
to=$(( start + len + 1 ))
cut -c$from-$to $inFile > ${outFile}.$nr
done
paste $outFile.* | sed -e 's/^/"/' -e 's/\t/","/' -e 's/$/"/' >${outFile}
rm $outFile.[0-9]

Bash (split) file name comparison fails

In my directory I have files (*fastq.gz.fasta) and directories, whose names contain the filenames (*fastq.gz.fasta-blastdb):
IVC6_Meino.clust.gz.fasta-blastdb
IVC5_Mehiv.clust.gz.fasta-blastdb
....
IVC6_Meino.clust.gz.fasta
IVC5_Mehiv.clust.gz.fasta
....
In a bash script I want to compare the filenames with the direcories using the cut option on the latter to extract only the filename part. If those two names match I want to do further stuff (for now echo match or no match respectively).
I have written the following piece of code:
#!/bin/bash
for file in *.fasta
do
for db in *-blastdb
do
echo $file, $db | cut -d '-' -f 1
if [[ $file = "$db | cut -d '-' -f 1" ]]; then
echo "match"
else
echo "no match"
fi
done
done
But it does not detect matches. The output looks like this:
...
IVC6_Meino.clust.gz.fasta, IIIA11_Meova.clust.gz.fasta
no match
IVC6_Meino.clust.gz.fasta, IVC5_Mehiv.clust.gz.fasta
no match
IVC6_Meino.clust.gz.fasta, IVC6_Meino.clust.gz.fasta
no match
The last line should read match as you can see, the strings look the same.
What am i missing?
You can use parameter expansion to do this more easily:
for file in *.fasta
do
for db in *-blastdb
do
echo "$file", "$db"
if [[ "${file%%.fasta}" = "${db%%.fasta-blastdb}" ]]; then
echo "match"
else
echo "no match"
fi
done
done
If you want to fix yours, the problem is the use of $db | cut -d '-' -f 1 With echo it appears that echo is printing the pipe. It isn't. cut is printing. When you do [[ $file = "$db | cut -d '-' -f 1" ]] it is equivalent to [[ $file = [return code from last pipe component] ]]
You need to use the $(..) shell construct to capture the output of the pipe and you need to echo to get the contents of $db to start the pipe. You should quote "$db" so you do not have word splitting or globbing from the contents of the variable.
Like so:
for file in *.fasta
do
for db in *-blastdb
do
ts=$(echo "$db" | cut -d '-' -f 1)
echo "$file", "$ts"
if [[ "$file" = "$ts" ]]; then
echo "match"
else
echo "no match"
fi
done
done # this works I think -- not tested...
Please be careful with your quoting with Bash and liberally use ShellCheck.
The structure you have is also not the most efficient. You will loop over the *-blastdb glob once for every file in *-blastdb. If you have a lot of files, that could get really slow.
To solve that, you could rewrite this loop with Bash arrays (best if you have Bash 4+) or use awk:
ext1=.fasta
ext2=.fasta-blastdb
awk 'FNR==NR{
s=$0
sub("\\"ext1"$","",s)
seen[s]=$0
next}
{
s=$0
sub("\\"ext2"$","",s)
if (s in seen)
print seen[s], $0
}
' ext1="$ext1" ext2="$ext2" <(for fn in *$ext1; do echo "$fn"; done) <(for fn in *$ext2; do echo "$fn"; done)
Each glob is only executing once and awk is using an array to test if the basenames are the same.
Best

bash Shell: lost first element data partially

Using bash shell:
I am trying to read a file line by line.
and every line contains two meaning full file names delimited by "``"
file:1 image_config.txt
bbbbb.mp4``thumb/hashdata.gif
bbbbb.mp4``thumb/hashdata2.gif
Shell Script
#!/bin/bash
filename="image_config.txt"
while IFS='' read -r line || [[ -n "$line" ]]; do
IFS='``' read -r -a array <<< "$line"
if [ "$line" = "" ]; then
echo lineempty
else
file=${array[0]}
hash=${array[2]}
echo $file$hash;
output=$(ffmpeg -v warning -ss 2 -t 0.8 -i $file -vf scale=200:-1 -gifflags +transdiff -y $hash);
echo $output;
# echo ${array[0]}${array[1]}${array[2]}
fi;
done < "$filename"
first time executed successfully but when loop executes second time.
variable file lost bbbbb from bbbbb.mp4
and following output comes out
Output :
user#domain [~/public_html/Videos]$ sh imager.sh
bbbbb.mp4thumb/hashdata.gif
.mp4thumb/hashdata2.gif
.mp4: No such file or directory
lineempty
Please check out Bash FAQ 89 - I'm using a loop which runs once per line of input but it only seems to run once; everything after the first line is ignored? which seems to be helpful in your case.
Aside:
There is no point in using the same character twice in IFS.
IFS=\`
Is enough.
Check out this:
var='abc``def'
IFS=\`\` read -ra arr <<< "$var"
printf '<%s>\n' "${arr[#]}"
Output:
<abc>
<>
<def>
As you can see, arr[0] is abc, arr[1] is empty and arr[2] is def, and not arr[0] is abc and arr[1] is def as one might expect.
Taken from the IFS wiki of Greycat and Lhunath Bash Guide :
The IFS variable is used in shells (Bourne, POSIX, ksh, bash) as the input field separator (or internal field separator). Essentially, it is a string of special characters which are to be treated as delimiters between words/fields when splitting a line of input.
Here is how you could do differently, avoiding a read in the read:
#!/bin/bash
filename="image_config.txt"
while IFS='' read -r line || [[ -n "$line" ]]; do
if [ "$line" = "" ]; then
echo lineempty
else
file=$( echo ${line} | awk -F \` ' { print $1 } ' )
hash=$( echo ${line} | awk -F \` ' { print $3 } ' )
echo $file$hash;
output=$(ffmpeg -v warning -ss 2 -t 0.8 -i $file -vf scale=200:-1 -gifflags +transdiff -y $hash);
echo $output;
fi;
done < "$filename"

Resources