How to read null terminated strings in pairs using bash - bash

Let say, have a command genpairs which generates null-terminated strings.
key1 \0 val1 \0 key2 \0 val2 \0
Want read the above input into bash variables in pairs. The following not works for me:
genpairs() { #for the demo
printf "%d\0x\0" 1
printf "%d\0y\0" 2
printf "%d\0z\0" 3
}
#the above generates 1 \0 x \0 2 \0 y \0 3 \0 z \0 etc...
while IFS= read -r -d '' key val; do
echo "key:[$key] val:[$val]"
done < <(genpairs)
prints
key:[1] val:[]
key:[x] val:[]
key:[2] val:[]
key:[y] val:[]
key:[3] val:[]
key:[z] val:[]
e.g. the read somewhat doesn't split on the $'\0' into two variables.
The wanted output:
key:[1] val:[x]
key:[2] val:[y]
key:[3] val:[z]
How to read null-terminated input into multiple variables?
EDITED the OP's question - added a better demo - x y z
I can solve it as:
n=0
while IFS= read -r -d '' inp; do
if (( n % 2 ))
then
val="$inp"
echo "key:[$key] val:[$val]"
else
key="$inp"
fi
let n++
done < <(genpairs)
This prints the
key:[1] val:[x]
key:[2] val:[y]
key:[3] val:[z]
but it looks to me really terrible solution...

Just use two read statements:
while IFS= read -r -d '' key && IFS= read -r -d '' val; do
echo "key:[$key] val:[$val]"
done < <(genpairs)
Using Bash≥4.4, you can also use mapfile with its -d switch:
while mapfile -n 2 -d '' ary && ((${#ary[#]}>=2)); do
echo "key:[${ary[0]}] val:[${ary[1]}]"
done < <(genpairs)

Related

Loop through table and parse multiple arguments to scripts in Bash

I am in a situation similar to this one and having difficulties implementing this kind of solution for my situation.
I have file.tsv formatted as follows:
x y
dog woof
CAT meow
loud_goose honk-honk
duck quack
with a fixed number of columns (but variable rows) and I need to loop those pairs of values, all but the first one, in a script like the following (pseudocode)
for elements in list; do
./script1 elements[1] elements[2]
./script2 elements[1] elements[2]
done
so that script* can take the arguments from the pair and run with it.
Is there a way to do it in Bash?
I was thinking I could do something like this:
list1={`awk 'NR > 1{print $1}' file.tsv`}
list2={`awk 'NR > 1{print $2}' file.tsv`}
and then to call them in the loop based on their position, but I am not sure on how.
Thanks!
Shell tables are not multi-dimensional so table element cannot store two arguments for your scripts. However since you are processing lines from file.tsv, you can iterate on each line, reading both elements at once like this:
#!/usr/bin/env sh
# Populate tab with a tab character
tab="$(printf '\t')"
# Since printf's sub-shell added a trailing newline, remove it
tab="${tab%?}"
{
# Read first line in dummy variable _ to skip header
read -r _
# Iterate reading tab delimited x and y from each line
while IFS="$tab" read -r x y || [ -n "$x" ]; do
./script1 "$x" "$y"
./script2 "$x" "$y"
done
} < file.tsv # from this file
You could try just a while + read loop with the -a flag and IFS.
#!/usr/bin/env bash
while IFS=$' \t' read -ra line; do
echo ./script1 "${line[0]}" "${line[1]}"
echo ./script2 "${line[0]}" "${line[1]}"
done < <(tail -n +2 file.tsv)
Or without the tail
#!/usr/bin/env bash
skip=0 start=-1
while IFS=$' \t' read -ra line; do
if ((start++ >= skip)); then
echo ./script1 "${line[0]}" "${line[1]}"
echo ./script2 "${line[0]}" "${line[1]}"
fi
done < file.tsv
Remove the echo's if you're satisfied with the output.

How to match 0/1 coded values to a key provided in the same file, and rewrite as a line (instead of a list), in bash

I have an input file, over 1,000,000 lines long which looks something like this:
G A 0|0:2,0:2:3:0,3,32
G A 0|1:2,0:2:3:0,3,32
G C 1|1:0,1:1:3:32,3,0
C G 1|1:0,1:1:3:32,3,0
A G 1|0:0,1:1:3:39,3,0
For my purposes, everything after the first : in the third field is irrelevant (but I left it in as it'll affect the code).
The first field defines the values coded as 0 in the third, and the second field defines the values coded as 1
So, for example:
G A 0|0 = G|G
G A 1|0 = A|G
G A 1|1 = A|A
etc.
I first need to decode the third field, and then convert it from a vertical list to a horizontal list of values, with the values before the | on one line, and the values after on a second line.
So the example at the top would look like this:
HAP0 GGCGG
HAP1 GACGA
I've been working in bash, but any other suggestions are welcome. I have a script which does the job - but it's incredibly slow and long-winded and I'm sure there's a better way.
echo "HAP0 " > output.txt
echo "HAP1 " >> output.txt
while IFS=$'\t' read -a array; do
ref=${array[0]}
alt=${array[1]}
data=${array[2]}
IFS=$':' read -a code <<< $data
IFS=$'|' read -a hap <<< ${code[0]}
if [[ "${hap[0]}" -eq 0 ]]; then
sed -i "1s/$/${ref}/" output.txt
elif [[ "${hap[0]}" -eq 1 ]]; then
sed -i "1s/$/${alt}/" output.txt
fi
if [[ "${hap[1]}" -eq 0 ]]; then
sed -i "2s/$/${ref}/" output.txt
elif [[ "${hap[1]}" -eq 1 ]]; then
sed -i "2s/$/${alt}/" output.txt
fi
done < input.txt
Suggestions?
Instead of running sed in a subshell, use parameter expansion.
#!/bin/bash
printf '%s ' HAP0 > tmp0
printf '%s ' HAP1 > tmp1
while read -a cols ; do
indexes=${cols[2]}
indexes=${indexes%%:*}
idx0=${indexes%|*}
idx1=${indexes#*|}
printf '%s' ${cols[idx0]} >> tmp0
printf '%s' ${cols[idx1]} >> tmp1
done < "$1"
cat tmp0
printf '\n'
cat tmp1
printf '\n'
rm tmp0 tmp1
The script creates two temporaty files, one contains the first line, the second file the second line.
Or, use Perl for even faster solution:
#!/usr/bin/perl
use warnings;
use strict;
my #haps;
while (<>) {
my #cols = split /[\s|:]+/, $_, 5;
$haps[$_] .= $cols[ $cols[ $_ + 2 ] ] for 0, 1;
}
print "HAP$_ $haps[$_]\n" for 0, 1;

bash while loop "eats" my space characters

I am trying to parse a huge text file, say 200mb.
the text file contains some strings
123
1234
12345
12345
so my script looked like
while read line ; do
echo "$line"
done <textfile
however using this above method, my string " 12345" gets truncated to "12345"
I tried using
sed -n "$i"p textfile
but the the throughput is reduced from 27 to 0.2 lines per second, which is inacceptable ;-)
any Idea howto solve this?
You want to echo the lines without a fieldsep:
while IFS="" read line; do
echo "$line"
done <<< " 12345"
When you also want to skip interpretation of special characters, use
while IFS="" read -r line; do
echo "$line"
done <<< " 12345"
You can write the IFS without double quotes:
while IFS= read -r line; do
echo "$line"
done <<< " 12345"
This seems to be what you're looking for:
while IFS= read line; do
echo "$line"
done < textfile
The safest method is to use read -r in comparison to just read which will skip interpretation of special characters (thanks Walter A):
while IFS= read -r line; do
echo "$line"
done < textfile
OPTION 1:
#!/bin/bash
# read whole file into array
readarray -t aMyArray < <(cat textfile)
# echo each line of the array
# this will preserve spaces
for i in "${aMyArray[#]}"; do echo "$i"; done
readarray -- read lines from standard input
-t -- omit trailing newline character
aMyArray -- name of array to store file in
< <() -- execute command; redirect stdout into array
cat textfile -- file you want to store in variable
for i in "${aMyArray[#]}" -- for every element in aMyArray
"" -- needed to maintain spaces in elements
${ [#]} -- reference all elements in array
do echo "$i"; -- for every iteration of "$i" echo it
"" -- to maintain variable spaces
$i -- equals each element of the array aMyArray as it cycles through
done -- close for loop
OPTION 2:
In order to accommodate your larger file you could do this to help alleviate the work and speed up the processing.
#!/bin/bash
sSearchFile=textfile
sSearchStrings="1|2|3|space"
while IFS= read -r line; do
echo "${line}"
done < <(egrep "${sSearchStrings}" "${sSearchFile}")
This will grep the file (faster) before it cycles it through the while command. Let me know how this works for you. Notice you can add multiple search strings to the $sSearchStrings variable.
OPTION 3:
and an all in one solution to have a text file with your search criteria and everything else combined...
#!/bin/bash
# identify file containing search strings
sSearchStrings="searchstrings.file"
while IFS= read -r string; do
# if $sSearchStrings empty read in strings
[[ -z $sSearchStrings ]] && sSearchStrings="${string}"
# if $sSearchStrings not empty read in $sSearchStrings "|" $string
[[ ! -z $sSearchStrings ]] && sSearchStrings="${sSearchStrings}|${string}"
# read search criteria in from file
done <"${sSearchStrings}"
# identify file to be searched
sSearchFile="text.file"
while IFS= read -r line; do
echo "${line}"
done < <(egrep "${sSearchStrings}" "${sSearchFile}")

ORD and CHR a file in Bash

I build ord and chr functions and they work just fine.
But if I take a file that contains \n, for example:
hello
CHECK THIS HIT
YES
when I ord everything I don't get any new line values. Why is that? I'm writing in Bash.
Here is the code that I am using:
function ord {
ordr="`printf "%d\n" \'$1`"
}
TEXT="`cat $1`"
for (( i=0; i<${#TEXT}; i++ ))
do
ord "${TEXT:$i:1}"
echo "$ordr"
done
Your ord function is really weird. Maybe it would be better to write it as:
function ord {
printf -v ordr "%d" "'$1"
}
Then you would use it as:
TEXT=$(cat "$1")
for (( i=0; i<${#TEXT}; i++ )); do
ord "${TEXT:$i:1}"
printf '%s\n' "$ordr"
done
This still leaves two problems: you won't be able to have null bytes and you won't see trailing newlines. For example (I called your script banana and chmod +x banana):
$ ./banana <(printf 'a\0b\n')
97
98
Two problems show here: the null byte is removed from Bash in the TEXT=$(cat "$1") part, as a Bash variable can't contain null bytes. Moreover, this step also trims trailing newlines.
A more robust approach would be to use read:
while IFS= read -r -n 1 -d '' char; do
ord "$char"
printf '%s\n' "$ordr"
done < "$1"
With this modification:
$ ./banana <(printf 'a\0b\n')
97
0
98
10
Note that this script will depend on your locale. With my locale (LANG="en_US.UTF-8):
$ ./banana <(printf 'a\0ℂ\n')
97
0
8450
10
whereas:
$ LANG= ./banana <(printf 'a\0ℂ\n')
97
0
226
132
130
10
That's to show you that Bash doesn't read bytes, but characters. So depending on how you want Bash to treat your data, set LANG accordingly.
If your script only does that, it's much simpler to not use an ord function at all:
#!/bin/bash
while IFS= read -r -n 1 -d '' char; do
printf '%d\n' "'$char"
done < "$1"
It's that simple!

How to perform a for loop on each character in a string in Bash?

I have a variable like this:
words="这是一条狗。"
I want to make a for loop on each of the characters, one at a time, e.g. first character="这", then character="是", character="一", etc.
The only way I know is to output each character to separate line in a file, then use while read line, but this seems very inefficient.
How can I process each character in a string through a for loop?
You can use a C-style for loop:
foo=string
for (( i=0; i<${#foo}; i++ )); do
echo "${foo:$i:1}"
done
${#foo} expands to the length of foo. ${foo:$i:1} expands to the substring starting at position $i of length 1.
With sed on dash shell of LANG=en_US.UTF-8, I got the followings working right:
$ echo "你好嗎 新年好。全型句號" | sed -e 's/\(.\)/\1\n/g'
你
好
嗎
新
年
好
。
全
型
句
號
and
$ echo "Hello world" | sed -e 's/\(.\)/\1\n/g'
H
e
l
l
o
w
o
r
l
d
Thus, output can be looped with while read ... ; do ... ; done
edited for sample text translate into English:
"你好嗎 新年好。全型句號" is zh_TW.UTF-8 encoding for:
"你好嗎" = How are you[ doing]
" " = a normal space character
"新年好" = Happy new year
"。全型空格" = a double-byte-sized full-stop followed by text description
${#var} returns the length of var
${var:pos:N} returns N characters from pos onwards
Examples:
$ words="abc"
$ echo ${words:0:1}
a
$ echo ${words:1:1}
b
$ echo ${words:2:1}
c
so it is easy to iterate.
another way:
$ grep -o . <<< "abc"
a
b
c
or
$ grep -o . <<< "abc" | while read letter; do echo "my letter is $letter" ; done
my letter is a
my letter is b
my letter is c
I'm surprised no one has mentioned the obvious bash solution utilizing only while and read.
while read -n1 character; do
echo "$character"
done < <(echo -n "$words")
Note the use of echo -n to avoid the extraneous newline at the end. printf is another good option and may be more suitable for your particular needs. If you want to ignore whitespace then replace "$words" with "${words// /}".
Another option is fold. Please note however that it should never be fed into a for loop. Rather, use a while loop as follows:
while read char; do
echo "$char"
done < <(fold -w1 <<<"$words")
The primary benefit to using the external fold command (of the coreutils package) would be brevity. You can feed it's output to another command such as xargs (part of the findutils package) as follows:
fold -w1 <<<"$words" | xargs -I% -- echo %
You'll want to replace the echo command used in the example above with the command you'd like to run against each character. Note that xargs will discard whitespace by default. You can use -d '\n' to disable that behavior.
Internationalization
I just tested fold with some of the Asian characters and realized it doesn't have Unicode support. So while it is fine for ASCII needs, it won't work for everyone. In that case there are some alternatives.
I'd probably replace fold -w1 with an awk array:
awk 'BEGIN{FS=""} {for (i=1;i<=NF;i++) print $i}'
Or the grep command mentioned in another answer:
grep -o .
Performance
FYI, I benchmarked the 3 aforementioned options. The first two were fast, nearly tying, with the fold loop slightly faster than the while loop. Unsurprisingly xargs was the slowest... 75x slower.
Here is the (abbreviated) test code:
words=$(python -c 'from string import ascii_letters as l; print(l * 100)')
testrunner(){
for test in test_while_loop test_fold_loop test_fold_xargs test_awk_loop test_grep_loop; do
echo "$test"
(time for (( i=1; i<$((${1:-100} + 1)); i++ )); do "$test"; done >/dev/null) 2>&1 | sed '/^$/d'
echo
done
}
testrunner 100
Here are the results:
test_while_loop
real 0m5.821s
user 0m5.322s
sys 0m0.526s
test_fold_loop
real 0m6.051s
user 0m5.260s
sys 0m0.822s
test_fold_xargs
real 7m13.444s
user 0m24.531s
sys 6m44.704s
test_awk_loop
real 0m6.507s
user 0m5.858s
sys 0m0.788s
test_grep_loop
real 0m6.179s
user 0m5.409s
sys 0m0.921s
I believe there is still no ideal solution that would correctly preserve all whitespace characters and is fast enough, so I'll post my answer. Using ${foo:$i:1} works, but is very slow, which is especially noticeable with large strings, as I will show below.
My idea is an expansion of a method proposed by Six, which involves read -n1, with some changes to keep all characters and work correctly for any string:
while IFS='' read -r -d '' -n 1 char; do
# do something with $char
done < <(printf %s "$string")
How it works:
IFS='' - Redefining internal field separator to empty string prevents stripping of spaces and tabs. Doing it on a same line as read means that it will not affect other shell commands.
-r - Means "raw", which prevents read from treating \ at the end of the line as a special line concatenation character.
-d '' - Passing empty string as a delimiter prevents read from stripping newline characters. Actually means that null byte is used as a delimiter. -d '' is equal to -d $'\0'.
-n 1 - Means that one character at a time will be read.
printf %s "$string" - Using printf instead of echo -n is safer, because echo treats -n and -e as options. If you pass "-e" as a string, echo will not print anything.
< <(...) - Passing string to the loop using process substitution. If you use here-strings instead (done <<< "$string"), an extra newline character is appended at the end. Also, passing string through a pipe (printf %s "$string" | while ...) would make the loop run in a subshell, which means all variable operations are local within the loop.
Now, let's test the performance with a huge string.
I used the following file as a source:
https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt
The following script was called through time command:
#!/bin/bash
# Saving contents of the file into a variable named `string'.
# This is for test purposes only. In real code, you should use
# `done < "filename"' construct if you wish to read from a file.
# Using `string="$(cat makefiles.txt)"' would strip trailing newlines.
IFS='' read -r -d '' string < makefiles.txt
while IFS='' read -r -d '' -n 1 char; do
# remake the string by adding one character at a time
new_string+="$char"
done < <(printf %s "$string")
# confirm that new string is identical to the original
diff -u makefiles.txt <(printf %s "$new_string")
And the result is:
$ time ./test.sh
real 0m1.161s
user 0m1.036s
sys 0m0.116s
As we can see, it is quite fast.
Next, I replaced the loop with one that uses parameter expansion:
for (( i=0 ; i<${#string}; i++ )); do
new_string+="${string:$i:1}"
done
The output shows exactly how bad the performance loss is:
$ time ./test.sh
real 2m38.540s
user 2m34.916s
sys 0m3.576s
The exact numbers may very on different systems, but the overall picture should be similar.
I've only tested this with ascii strings, but you could do something like:
while test -n "$words"; do
c=${words:0:1} # Get the first character
echo character is "'$c'"
words=${words:1} # trim the first character
done
It is also possible to split the string into a character array using fold and then iterate over this array:
for char in `echo "这是一条狗。" | fold -w1`; do
echo $char
done
The C style loop in #chepner's answer is in the shell function update_terminal_cwd, and the grep -o . solution is clever, but I was surprised not to see a solution using seq. Here's mine:
read word
for i in $(seq 1 ${#word}); do
echo "${word:i-1:1}"
done
#!/bin/bash
word=$(echo 'Your Message' |fold -w 1)
for letter in ${word} ; do echo "${letter} is a letter"; done
Here is the output:
Y is a letter
o is a letter
u is a letter
r is a letter
M is a letter
e is a letter
s is a letter
s is a letter
a is a letter
g is a letter
e is a letter
To iterate ASCII characters on a POSIX-compliant shell, you can avoid external tools by using the Parameter Expansions:
#!/bin/sh
str="Hello World!"
while [ ${#str} -gt 0 ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done
or
str="Hello World!"
while [ -n "$str" ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done
sed works with unicode
IFS=$'\n'
for z in $(sed 's/./&\n/g' <(printf '你好嗎')); do
echo hello: "$z"
done
outputs
hello: 你
hello: 好
hello: 嗎
Another approach, if you don't care about whitespace being ignored:
for char in $(sed -E s/'(.)'/'\1 '/g <<<"$your_string"); do
# Handle $char here
done
Another way is:
Characters="TESTING"
index=1
while [ $index -le ${#Characters} ]
do
echo ${Characters} | cut -c${index}-${index}
index=$(expr $index + 1)
done
fold and while read are great for the job as shown in some answers here. Contrary to those answers, I think it's much more intuitive to pipe in the order of execution:
echo "asdfg" | fold -w 1 | while read c; do
echo -n "$c "
done
Outputs: a s d f g
I share my solution:
read word
for char in $(grep -o . <<<"$word") ; do
echo $char
done
TEXT="hello world"
for i in {1..${#TEXT}}; do
echo ${TEXT[i]}
done
where {1..N} is an inclusive range
${#TEXT} is a number of letters in a string
${TEXT[i]} - you can get char from string like an item from an array

Resources