BASH: Padding a series of HEX values based on the longest string - bash

I have this odd condition where I've been given a series of HEX values that represent binary data. The interesting thing is that they are occasionally different lengths, such as:
40000001AA
0000000100
A0000001
000001
20000001B0
40040001B0
I would like to append 0's on the end to make them all the same length based on the longest entry. So, in the example above I have four entires that are 10 characters long, terminated by '\n', and a few short ones (in the actual data, I 200k of entries with about 1k short ones). What I would like to do figure out the longest string in the file, and then go through and pad the short ones; however, I haven't been able to figure it out. Any suggestions would be appreciated.

Using standard two-pass awk:
awk 'NR==FNR{if (len < length()) len=length(); next}
{s = sprintf("%-*s", len, $0); gsub(/ /, "0", s); print s}' file file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
Or using gnu wc with awk:
awk -v len="$(wc -L < file)" '
{s = sprintf("%-*s", len, $0); gsub(/ /, "0", s); print s}' file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0

As you use Bash there is a big chance that you also use other GNU
tools. In such case wc can easily tell you the the length of the
greatest line in the file using -L option. Example:
$ wc -L /tmp/HEX
10 /tmp/HEX
Padding can be done like this:
$ while read i; do echo $(echo "$i"0000000000 | head -c 10); done < /tmp/HEX
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
A one-liner:
while read i; do eval printf "$i%.s0" {1..$(wc -L /tmp/HEX | cut -d ' ' -f1)} | head -c $(wc -L /tmp/HEX | cut -d ' ' -f1); echo; done < /tmp/HEX

In general to zero-pad a string from either or both sides is (using 5 as the desired field width for example):
$ echo '17' | awk '{printf "%0*s\n", 5, $0}'
00017
$ echo '17' | awk '{printf "%s%0*s\n", $0, 5-length(), ""}'
17000
$ echo '17' | awk '{w=int((5+length())/2); printf "%0*s%0*s\n", w, $0, 5-w, ""}'
01700
$ echo '17' | awk '{w=int((5+length()+1)/2); printf "%0*s%0*s\n", w, $0, 5-w, ""}'
00170
so for your example:
$ awk '{cur=length()} NR==FNR{max=(cur>max?cur:max);next} {printf "%s%0*s\n", $0, max-cur, ""}' file file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0

Let's suppose you have this values in file:
file=/tmp/hex.txt
Find out length of longest number:
longest=$(wc -L < $file)
Now for each number in file justify it with zeroes
while read number; do
printf "%-${longest}s\n" $number | sed 's/ /0/g'
done < $file
This what will print script to stdout:
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0

Related

using awk to print header name and a substring

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work
>output_file
cat input_file | while read row; do
echo $row > temp
geneName=`awk '{print $1}' tmp`
startPos=`awk '{print $2}' tmp`
endPOs=`awk '{print $3}' tmp`
for i in temp; do
echo ">${geneName}" >> genes_fasta ;
echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
done
done
input_file
nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548
fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............
and this is my wrong output file
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
output should look like that:
>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....
You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.
For example:
awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
or using getline within the BEGIN rule:
awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
Example Input Files
Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:
$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79
and the first 129-characters of your example fasta
$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
Example Use/Output
$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg
Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.
If I'm understanding correctly, how about:
awk 'NR==FNR {fasta = fasta $0; next}
{
printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
}' fasta input_file > genes_fasta
It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)
Hope this helps.
made it work!
this is the script for pulling substrings from a fasta file
cat genes_and_bounderies1 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
done
done

Replace hex char by a different random one (awk possible?)

have a mac address and I need to replace only one hex char (one at a very specific position) by a different random one (it must be different than the original). I have it done in this way using xxd and it works:
#!/bin/bash
mac="00:00:00:00:00:00" #This is a PoC mac address obviously :)
different_mac_digit=$(xxd -p -u -l 100 < /dev/urandom | sed "s/${mac:10:1}//g" | head -c 1)
changed_mac=${mac::10}${different_mac_digit}${mac:11:6}
echo "${changed_mac}" #This echo stuff like 00:00:00:0F:00:00
The problem for my script is that using xxd means another dependency... I want to avoid it (not all Linux have it included by default). I have another workaround for this using hexdump command but using it I'm at the same stage... But my script already has a mandatory awk dependency, so... Can this be done using awk? I need an awk master here :) Thanks.
Something like this may work with seed value from $RANDOM:
mac="00:00:00:00:00:00"
awk -v seed=$RANDOM 'BEGIN{ FS=OFS=":"; srand(seed) } {
s="0"
while ((s = sprintf("%x", rand() * 16)) == substr($4, 2, 1))
$4 = substr($4, 1, 1) s
} 1' <<< "$mac"
00:00:00:03:00:00
Inside while loop we continue until hex digit is not equal to substr($4, 2, 1) which 2nd char of 4th column.
You don't need xxd or hexdump. urandom will also generate nubmers that match the encodings of the digits and letters used to represent hexadecimal numbers, therefore you can just use
old="${mac:10:1}"
different_mac_digit=$(tr -dc 0-9A-F < /dev/urandom | tr -d "$old" | head -c1)
Of course, you can replace your whole script with an awk script too. The following GNU awk script will replace the 11th symbol of each line with a random hexadecimal symbol different from the old one. With <<< macaddress we can feed macaddress to its stdin without having to use echo or something like that.
awk 'BEGIN { srand(); pos=11 } {
old=strtonum("0x" substr($0,pos,1))
new=(old + 1 + int(rand()*15)) % 16
print substr($0,1,pos-1) sprintf("%X",new) substr($0,pos+1)
}' <<< 00:00:00:00:00:00
The trick here is to add a random number between 1 and 15 (both inclusive) to the digit to be modified. If we end up with a number greater than 15 we wrap around using the modulo operator % (16 becomes 0, 17 becomes 1, and so on). That way the resulting digit is guaranteed to be different from the old one.
However, the same approach would be shorter if written completely in bash.
mac="00:00:00:00:00:00"
old="${mac:10:1}"
(( new=(16#"$old" + 1 + RANDOM % 15) % 16 ))
printf %s%X%s\\n "${mac::10}" "$new" "${mac:11}"
"One-liner" version:
mac=00:00:00:00:00:00
printf %s%X%s\\n "${mac::10}" "$(((16#${mac:10:1}+1+RANDOM%15)%16))" "${mac:11}"
bash has printf builtin and a random function (if you trust it):
different_mac_digit() {
new=$1
while [[ $new = $1 ]]; do
new=$( printf "%X" $(( RANDOM%16 )) )
done
echo $new
}
Invoke with the character to be replaced as argument.
Another awk:
$ awk -v n=11 -v s=$RANDOM ' # set n to char # you want to replace
BEGIN { FS=OFS="" }{ # each char is a field
srand(s)
while((r=sprintf("%x",rand()*16))==$n);
$n=r
}1' <<< $mac
Output:
00:00:00:07:00:00
or oneliner:
$ awk -v n=11 -v s=$RANDOM 'BEGIN{FS=OFS=""}{srand(s);while((r=sprintf("%x",rand()*16))==$n);$n=r}1' <<< $mac
$ mac="00:00:00:00:00:00"
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:01:00:00
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:0D:00:00
And to ensure you get a different digit than you started with:
$ awk -v mac="$mac" -v pos=11 'BEGIN {
srand()
new = old = toupper(substr(mac,pos,1))
while (new==old) {
new = sprintf("%X", int(rand()*15-1))
}
print substr(mac,1,pos-1) new substr(mac,pos+1)
}'
00:00:00:0D:00:00

Splitting csv file into multiple files with 2 columns in each file

I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt

How can I use bash to split only some elements of a text file?

I'm trying to figure out how to make a .txt file (myGeneFile.txt) of IDs and genes that looks like this:
Probe Set ID Gene Symbol
1007_s_at DDR1 /// MIR4640
1053_at RFC2
117_at HSPA6
121_at PAX8
1255_g_at GUCA1A
1294_at MIR5193 /// UBA7
into this:
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA
First I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{split($2,a,"///"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5];}' > test.txt
(i.e., I removed the top (header) line of the file, I tried splitting the second line along the delimiter ///, and then printing any genes that might appear)
Then, I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{print $2}' | grep -o -E '\w+' > test.txt
(literally listing out all of the words in the second column)
I got the same output in both cases - a long list of just the first gene in each row (e.g. so MIR4640 and UBA7 were mising)
Any ideas?
EDIT: Thanks #CodeGnome for your help. I ended up using that code and modifying it because I discovered that my file had between 1 and 30 different gene names on each row. So, I used:
awk 'NR == 1 {next}
{
sub("///", "")
print $2 }
{ for (i=3; i<=30; i++)
if ($i) {print $i}
}' myGeneFile.txt > test2.txt
#GlenJackson also had a solution that worked really well:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
My awk take:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
or sed
sed '
1d # delete the header
s/[[:blank:]]\+/ /g # squeeze whitespace
s/^[^ ]\+ // # remove the 1st word
s| ///||g # delete all "///" words
s/ /\n/g # replace spaces with newlines
' file
Use Conditional Print Statements Inside an AWK Action
The following gives the desired output by removing unwanted characters with sub(), and then using multiple print statements to create the line breaks. The second print statement is conditional, and only triggers when the third field isn't empty; this avoids creating extraneous empty lines in the output.
$ awk 'NR == 1 {next}
{
sub("///", "")
print $2
if ($3) {print $3}
}' myGeneFile.txt
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA7
This will work:
tail -n+2 tmp | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_'
Here's what is happening:
tail -n+2 Strip off the header
sed -E 's/ +/ /' Condense the whitespace
cut -d' ' -f2- Use cut to select all fields but the first, using a single space as the delimiter
sed 's_ */// *_\n_' Convert all /// (and any surrounding whitespace) to a newline
You don't need the initial cat, it's usually better to simply pass the input file as an argument to the first command. If you want the file name in a place that is easy to change, this is a better option as it avoids the additional process (and I find it easier to change the file if it's at the end):
(tail -n+2 | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_') < tmp
Given the existing input and the modified requirement (from the comment on Morgen's answer) the following should do what you want (for any number of gene columns).
awk 'NR > 1 {
p=0
for (i = 2; i <= NF; i++) {
if ($i == "///") {
p=1
continue
}
printf "%s%s\n", p?"n":"", $i
}
}' input.txt
Your criteria for selecting which strings to output is not entirely clear, but here's another command that at least produces your expected output:
tail -n +2 myGeneFile.txt | grep -oE '\<[A-Z][A-Z0-9]*\>'
It basically just 1) skips the first line and 2) finds all other words (delimited by non-word characters and/or start/end of line) that consist entirely of uppercase letters or digits, with the first being a letter.

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Resources