Wrap a single oversize column with awk / bash (pretty print) - bash

Im having this table structure (assume that the delimiters are tabs):
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description
What i want is this:
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery
long description which
will easily extend the
recommended output width
of 80 characters.
03 Etim Last description
That means I want to split $3 into an array of strings with predefined WIDTH, where the first element is appended "normally" to the current line and all subsequent elements get a new line width identation according to the padding of the first two columns (padding could also be fixed if thats easier).
Alternatively, the text in $0 could be split by a GLOBAL_WIDTH (e.g. 80 chars) into first string and "rest" -> first string gets printed "normally" with printf, the rest is split by GLOBAL_WIDTH - (COLPAD1 + COLPAD2) and appended width new lines as above.
I tried to work with fmt and fold after my awk formatting (which is basically just putting headings to the table) but they do not reflect awk's field perceptance of course.
How can I achieve this using bash tools and / or awk?

First build a test file (called file.txt):
echo "AA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description" > file.txt
Now the script (called ./split-columns.sh):
#!/bin/bash
FILE=$1
#find position of 3rd column (starting with 'CCC')
padding=`cat $FILE | head -n1 | grep -aob 'CCC' | grep -oE '[0-9]+'`
paddingstr=`printf "%-${padding}s" ' '`
#set max length
maxcolsize=50
maxlen=$(($padding + $maxcolsize))
cat $FILE | while read line; do
#split the line only if it exceeds the desired length
if [[ ${#line} -gt $maxlen ]] ; then
echo "$line" | fmt -s -w$maxcolsize - | head -n1
echo "$line" | fmt -s -w$maxcolsize - | tail -n+2 | sed "s/^/$paddingstr/"
else
echo "$line";
fi;
done;
Finally run it with the file as a single argument
./split-columns.sh file.txt > fixed-width-file.txt
Output will be:
AA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description
which will easily extend the recommended output
width of 80 characters.
03 Etim Last description

You can try Perl one-liner
perl -lpe ' s/(.{20,}?)\s/$1\n\t /g ' file
with the given inputs
$ cat thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which will easily extend the recommended output width of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{20,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description
here
02 Meti A very very
veeeery long description
which will easily extend
the recommended output
width of 80 characters.
03 Etim Last description
$
If you want to try with length window of 30/40/50
$ perl -lpe ' s/(.{30,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery
long description which will easily
extend the recommended output width
of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{40,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description
which will easily extend the recommended
output width of 80 characters.
03 Etim Last description
$ perl -lpe ' s/(.{50,}?)\s/$1\n\t /g ' thurse.txt
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which
will easily extend the recommended output width of
80 characters.
03 Etim Last description
$

#!/usr/bin/awk -f
# Read standard input, which should be a file of lines each line
# containing tab-separated strings. The string values may be very long.
# Columnate the output by
# wrapping long strings onto multiple lines within each field's
# specified length.
# Arguments are numeric field lengths. If an input line contains more
# values than the # of field lengths supplied, the last field length will
# be re-used.
#
# arguments are the field lengths
# invoke like this: wrapcolumns 30 40 40
BEGIN {
FS=" ";
for (i = 1; i < ARGC; i++) {
fieldlengths[i-1] = ARGV[i];
ARGV[i]="";
}
if (ARGC < 2) {
print "usage: wrapcolumns length1 ... lengthn";
exit;
}
}
function blanks(n) {
result = " ";
while (length(result) < n) {
result = result result;
}
return substr(result, 1, n);
}
{
# ARGC - 1 is the length of the fieldlengths array
# So ARGC - 2 is the index of its last element because its zero-origin.
# if the input line has more fields than the fieldlengths array,
# use the last element.
# any nonempty fields left?
gotanyleft = 1;
while (gotanyleft == 1) {
gotanyleft = 0;
for (i = 1; i <= NF; i++) {
# length of the current field
len = (ARGC - 2 < i) ? (fieldlengths[ARGC - 2]) : fieldlengths[i - 1];
# print that much of the current field and remove that much from the front
printf "%s", substr($(i) blanks(len), 1, len) ":::"
$(i) = substr($(i), len + 1);
if ($(i) != "") {
gotanyleft = 1;
}
}
print ""
}
}

loop-free awk-solution :
{m,g}awk -v ______="${WIDTH}" 'BEGIN {
1 OFS = ""
1 FS = "\t"
1 ___ = "\32\23"
1 __ = sprintf("\n%*s",
(_+=_^=_<_)+_^!_+(_+=_____=_+=_+_)+_____,__)
1 ____ = sprintf("%*s",______-length(__),"")
1 gsub(".",".",____)
sub("[.].......$","..?.?.?.?.?.?.?.[ ]",____)
1 ______ = _
} $!NF = sprintf("%.*s %*s %-*s %-s", _<_,_= $NF,_____,
$2,______, $--NF, substr("",gsub(____,
("&")___,_) * gsub("("(___)")+$","",_),
__ * gsub( (___), (__),_) )_)'
|
AAA BBBB CCC
01 Item Description here
02 Meti A very very veeeery long description which
will easily extend the recommended output
width of 80 characters.
03 Etim Last description

Related

How to loop through character in string and still detect null char in Bash

I have this function:
function convert_ascii_string_to_decimal {
ascii=$1
unset converted_result
while IFS="" read -r -n 1 char; do
decimal=$(printf '%d' "'$char")
echo $decimal
converted_result="$converted_result $decimal"
done < <(printf %s "$ascii")
converted_result=$(echo $converted_result | xargs) #strip leading and trailing
}
It is meant to take an ascii string variable, loop through every character, and concatenate the ascii decimal representation to a string. However, this while loop seems to ignore null chars, ie characters with ascii 0. I want to be able to read every single ascii there is, including null.
To get all characters of a string as decimal number, you can use hexdump to parse a string:
echo -e "hello \x00world" | hexdump -v -e '1/1 "%d "'
104 101 108 108 111 32 0 119 111 114 108 100 10
This also works for parsing a file:
echo '05 04 03 02 01 00 ff' | xxd -r -ps > file
hexdump --no-squeezing --format '1/1 "%d "' file
5 4 3 2 1 0 255
hexdump explanation:
options -v and --no-squeezing prints all bytes (without skipping duplicated bytes)
options -e and --format allows giving a specific format
format is 1/1 "%d " which means
Iteration count = 1 (process the byte only once)
Byte count = 1 (apply this format for each byte)
Format = "%d" (convert to decimal)
You can't store the null character in a bash variable, which is happening in your script with the $char variable.
I suggest using xxd instead of writing your own script:
echo -ne "some ascii text" | xxd -p
If we echo a null charcter:
$ echo -ne "\0" | xxd -p
00

Ascii value of Alphanumeric String

Hi is there anyway to get ascii value of Alphanumeric String without reading single character at a time .
For eg if I enter A ,output should be 65.
If I enter Onkar123#. How to calculate ascii of this string?
Also I want sum of ascii value produced by the above string.
Try using echo "test" | hexdump -e '16/1 "%02x " "\n"' by replacing test with Onkar123# or anything else
idk what kind of output you expect, nor do I know why you care if the string is processed one char at a time or how you'd know if a given tool is going one char at a time (and how else COULD any tool be doing this anyway?) so idk if this is the kind of answer you're looking for or not but maybe this will point you in a direction at least:
$ printf '%s' "Onkar123#" | awk -l ordchr -v RS='.{1}' '{print ord(RT)}'
79
110
107
97
114
49
50
51
35
The above uses GNU awk for ord() in the ordchr library.
Based on one of your comments, it sounds like this might be what you're looking for:
$ printf '%s' "Onkar123#" | awk -l ordchr -v RS='.{1}' '{s+=ord(RT)} END{print s+0}'
692
od
There's really no such thing as the ASCII value of a string. There is such a thing as the decimal (or octal, or hexadecimal) value of each ASCII character in a string, though.
Since you don't seem to have hexdump, try the od (octal dump) utility. I don't think I've ever seen a *nix system that didn't have od.
$ echo "Onkar123#" | od -An -t d1
79 110 107 97 114 49 50 51 35 10
I guess endianness might come into play. But od has a --endian argument for that.
awk
It's a lot harder in awk. I think you have to build a lookup table, then lookup the decimal code for each character in the input. That means you still have to process one character at a time.
# output-decimal-ascii.awk -- write ASCII decimal codes for input
BEGIN {
# 127 for ASCII; 256 for extended ASCII
for(n = 0; n < 127; n++) {
ascii_table[sprintf("%c",n)] = n
}
}
{
split($0, arr, "")
for (i = 1; i <= length(arr); i++) {
printf("%d ", ascii_table[arr[i]])
}
print "\n"
}
$ echo "Onkar123#" | awk -f code/awk/output-decimal-ascii.awk
79 110 107 97 114 49 50 51 35
To sum the numbers use:
echo "test" | od -An -t d1 | xargs | sed "s/ /+/g" | bc

Shell Script: Arithmetic operation in array

I'm doing for fun and this as part of my learning process in Shell scripting.
Let say I have initial input A B C
What I'm trying to do is to split the string and convert each of them to decimal value.
A B C = 65 66 67
Then I'll add the decimal value to random number, let say number 1.
Now, decimal value will become = 66 67 68
Finally, I'll convert the decimal to the original value again which will become B C D
ubuntu#Ubuntu:~$ cat testscript.sh -n
#!/bin/bash
1 string="ABC"
2
3 echo -e "\nSTRING = $string"
4 echo LENGTH = ${#string}
5
6 # TUKAR STRING KE ARRAY ... word[x]
7 for i in $(seq 0 ${#string})
8 do word[$i]=${string:$i:1}
9 done
10
11 echo -e "\nZero element of array is [ ${word[0]} ]"
12 echo -e "Entire array is [ ${word[#]}] \n"
13
14 # CHAR to DECIMAL
15 for i in $(seq 0 ${#string})
16 do
17 echo -n ${word[$i]}
18 echo -n ${word[$i]} | od -An -tuC
19 chardec[$i]=$(echo -n ${word[$i]} | od -An -tuC)
20 done
21
22 echo -e "\nNEXT, DECIMAL VALUE PLUS ONE"
23 for i in $(seq 0 ${#string})
24 do
25 echo `expr ${chardec[$i]} + 1`
26 done
27
28 echo
This is the output
ubuntu#Ubuntu:~$ ./testscript.sh
STRING = ABC
LENGTH = 3
Zero element of array is [ A ]
Entire array is [ A B C ]
A 65
B 66
C 67
NEXT, DECIMAL VALUE PLUS ONE
66
67
68
1
As you can see in the output, there are 2 problems (or maybe more)
The last for loop processing additional number. Any idea how to fix this?
NEXT, DECIMAL VALUE PLUS ONE
66
67
68
1
This is the formula to convert decimal value to char. I'm trying to put the last value to another array and then put it in another loop for this purpose. However, I'm still have no idea how to do this in loop based on previous data.
ubuntu#Ubuntu:~$ printf "\x$(printf %x 65)\n"
A
Please advise
Using bash you can replace all of your code with this code:
for i; do
printf "\x"$(($(printf '%x' "'$i'") +1))" "
done
echo
When you run it as:
./testscript.sh P Q R S
It will print:
Q R S T
awk to the rescue!
simpler to do the same in awk environment.
$ echo "A B C" |
awk 'BEGIN{for(i=33;i<127;i++) o[sprintf("%c",i)]=i}
{for(i=1;i<=NF;i++) printf "%c%s", o[$i]+1, ((i==NF)?ORS:OFS)}'
B C D
seq is from FIRST to LAST, so if your string length is 3, then seq 0 3 will give you <0,1,2,3>. Your second to last loop (lines 16-20) is actually running four iterations, but the last iteration prints nothing.
To printf the ascii code, insert it inline, like
printf "\x$(printf %x `expr ${chardec[$i]} + 1`) "
or more readably:
dec=`expr ${chardec[$i]} + 1`
printf "\x$(printf %x $dec)\n"

bash sort and paste columns in alphabetical order

I have a file.txt which has the following columns
id chr pos alleleA alleleB
1 01 1234 CT T
2 02 5678 G A
3 03 8901 T C
4 04 12345 C G
5 05 567890 T A
I am looking for a way of creating a new column so that it looks like : chr:pos:alleleA:alleleB
The problem is that alleleA and alleleB should be sorted based on:
1. alphabetical order
2. either of these two columns with more letter per line should be first and followed by the second column
In this example , it would look like this :
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
I appreciate any help and suggestion. Thanks.
EDIT
Up to now i can modify chr column so that it will have a look as "chr:1"...
AlleleA and AlleleB columns should be combined so that if either of columns contains more than 1 letter, in column newID it would be in the first place. If there is only one letter in both columns, these letters are arranged alphabetically in the newID column
gawk solution:
awk 'function custom_sort(i1,v1,i2,v2){ # custom function to compare 2 crucial fields
l1=length(v1); l2=length(v2); # getting length of both fields
if (l1 == l2) {
return (v1 > v2)? 1:-1 # compare characters if field lengths are equal
} else {
return l2 - l1 # otherwise - compare by length (descending)
}
} NR==1 { $0=$0 FS "newID" } # add new column
NR>1 { a[1]=$4; a[2]=$5; asort(a,b,"custom_sort"); # sort the last 2 columns using function `custom_sort`
$(NF+1) = sprintf("chr%s:%s:%s:%s",$1,$3,b[1],b[2])
}1' file.txt | column -t
The output:
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
Perl to the rescue:
perl -lane '
if (1 == $.) { print "$_ newID" }
else { print "$_ ", join ":", "chr" . ($F[1] =~ s/^0//r),
$F[2],
sort { length $b <=> length $a
or $a cmp $b
} #F[3,4];
}' -- input.txt
-l removes newlines from input and adds them to print
-n reads the input line by line
-a splits each input line on whitespace into the #F array
$. is the input line number, the condition just prints the header for the first line
s/^0// removes the initial zero from $F[1] (i.e. column 2)
/r returns the result of the substitution
the last two column lenghts are compared, if they are the same, string comparison is used.

bash: cat the first lines of a file & get position

I got a very big file that contains n lines of text (with n being <1000) at the beginning, an empty line and then lots of untyped binary data.
I would like to extract the first n lines of text, and then somehow extract the exact offset of the binary data.
Extracting the first lines is simple, but how can I get the offset? bash is not encoding aware, so just counting up the number of characters is senseless.
grep has an option -b to output the byte offset.
Example:
$ hexdump -C foo
00000000 66 6f 6f 0a 0a 62 61 72 0a |foo..bar.|
00000009
$ grep -b "^$" foo
4:
$ hexdump -s 5 -C foo
00000005 62 61 72 0a |bar.|
00000009
In the last step I used 5 instead of 4 to skip the newline.
Also works with umlauts (äöü) in the file.
Use grep to find the empty line
grep -n "^$" your_file | tr -d ':'
Optionally use tail -n 1 if you want the last empty line (that is, if the top part of the file can contain empty lines before the binary stuff starts).
Use head to get the top part of the file.
head -n $num
you might want to use tools like hexdump or od to retrieve binary offsets instead of bash. Here's a reference.
Perl can tell you where you are in a file:
pos=$( perl -le '
open $fh, "<", $ARGV[0];
$/ = ""; # read the file in "paragraphs"
$first_paragraph = <$fh>;
print tell($fh)
' filename )
Parenthetically, I was attempting to one-liner this
pos=$( perl -00 -lne 'if ($. == 2) {print tell(___what?___); exit}' filename
What is the "current filehandle" variable? I couldn't find it in the docs.

Resources