Bash split stdin by null and pipe to pipeline - bash

I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.

The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}

As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.

Related

variable passing through awk command [duplicate]

This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 1 year ago.
here's my issue, I have a bunch of fastq.gz files and I need to determine the number of lines of it (this is not the issue), and from that number of line derive a value that determine a threshold used as a variable used down in the same loop. I browsed but cannot find how to do it. here's what I have so far:
for file in *R1.fastq*; do
var=echo $(zcat "$file" | $((`wc -l`/400000)))
for i in *Bacter*; do
awk -v var1=$var '{if($2 >= var1) print $0}' ${i} | wc -l >> bacter-filtered.txt
done
done
I get the error message: -bash: 14850508/400000: No such file or directory
any help would be greatly appreciated !
The problem is in the line
var=echo $(zcat "$file" | $((`wc -l`/400000)))
There are a bunch of shell syntax elements here combined in ways that don't connect up with each other. To keep things straight, I'd recommend splitting it into two separate operations:
lines=$(zcat "$file" | wc -l)
var=$((lines/400000))
(You may also have to do something about the output to bacter-filtered.txt -- it's just going to contain a bunch of numbers, with no identifications of which ones come from which files. Also since it always appends, if you run this twice you'll have the output from both runs stuck together. You might want to replace all those appends with a single > bacter-filtered.txt after the last done, so the whole output just gets stored directly.)
What's wrong with the original? Well, let's start with this:
zcat "$file" | $((`wc -l`/400000))
Unless I completely misunderstand, the purpose here is to extract $file (with zcat), count lines in the result (with wc -l), and divide that by 400000. But since the output of zcat isn't piped directly to wc, it's piped to a complex expression involving wc, it's somewhat ambiguous what should happen, and is actually different under different shells. In zsh, it does something completely different from that: it lets wc read from the script's stdin (generally your Terminal), divides the result from that by 400000, and then pipes the output of zcat to that ... number?
In bash, it does something closer to what you want: wc actually does read from the output of zcat, so the second part of the pipe essentially turns into:
... | $((14850508/400000))
Now, what I'd expect to happen at this point (and happens in my tests) is that it should evaluate $((14850508/400000)) into 37, giving:
... | 37
which will then try to execute 37 as a command (because it's part of a pipeline, and therefore is supposed to be a command). But for some reason it's apparently not evaluating the division and just trying to execute 14850508/400000 as a command. Which doesn't really work any better or worse than 37, so I guess it doesn't matter much.
So that's where the error is coming from, but there's actually another layer of confusion in the original line. Suppose that internal pipeline was fixed so that it properly output "37" (rather than trying to execute it). The outer structure would then be:
var=echo $(cmdthatprints37)
The $( ) basically means "run the command inside, and substitute its output into the command line here", so that would evaluate to:
var=echo 37
...which, in shell syntax, means "run the command 37 with var set to "echo" in its environment.
The solution here would be simple. The echo is messing everything up so remove it:
var=$(cmdthatprints37)
...which evaluates to:
var=37
...which is what you want. Except that, as I said above, it'd be better to split it up and do the command bits and the math separately rather than getting them mixed up.
BTW, I'd also recommend some additional double-quoting of shell variables; shellcheck.net will be happy to point out where.

Bash: getting keyboard input with timeout

I have a script that aims to find out which key is pressed. The problem is that I don't manage it quickly because I need the timeout in the fifth digit that makes it not react quickly, or not react at all.
#!/bin/bash
sudo echo Start
while true
do
file_content=$(sudo timeout 0.5s cat /dev/input/event12 | hexdump)
content_split=$(echo $file_content | tr " " "\n")
word_counter=0
for option in $content_split
do
word_counter=$((word_counter+1))
if [ $word_counter -eq 25 ]
then
case $option in
"0039")echo "<space>";;
"001c")echo "<return>";;
"001e")echo "a";;
"0030")echo "b";;
"002e")echo "c";;
"0020")echo "d";;
"0012")echo "e";;
"0021")echo "f";;
"0022")echo "g";;
"0023")echo "h";;
"0017")echo "i";;
"0024")echo "j";;
"0025")echo "k";;
"0026")echo "l";;
"0032")echo "m";;
"0031")echo "n";;
"0018")echo "o";;
"0019")echo "p";;
"0010")echo "q";;
"0013")echo "r";;
"001f")echo "s";;
"0014")echo "t";;
"0016")echo "u";;
"002f")echo "v";;
"0011")echo "w";;
"002d")echo "x";;
"002c")echo "y";;
"0015")echo "z";;
esac
fi
done
done
Do not run cat in a timeout in a loop - that's just an invalid way to look at the problem. No matter how "fast" your program runs it will always miss some events that way. Overall polling approach is just invalid here.
The parsing and linux philosophy is build around streams that transfers bytes. The always available streams are stdin, stdout and stderr. They allow to pass data from one context to another. The shell most common | operator allows to bind together output from one program to input of another program - and this is the way™ you should work in shell. Shell primary use is to "connect" programs together.
So you could do:
# read from the input
sudo cat /dev/input/mouse1 |
# transform input to hex data one byte at at time so
# we could parse it in **shell**
xxd -c1 -p |
# read one byte and parse it in **shell**
while IFS= read -r line; do
: parse line
done
but shell is very slow and while read is very slow. If you want speed (and events from input are going to be fast) do not use shell and use a good programming language - python, perl, ruby, at least awk - these are common scripting languages. The case $option in construct looks like a mapping from hex values to output strings. I could see:
# save one `cat` process by just calling xxd
sudo xxd -c1 -p dev/input/mouse1 |
awk 'BEGIN{
map["1c"]="<return>";
map["1e"]="a";
# etc. add all other cases to mapping
}
{ if ($0 in a) print map[$0] }
'

Is there a way for me to simplify these echos? [duplicate]

This question already has answers here:
How do I iterate over a range of numbers defined by variables in Bash?
(20 answers)
Closed 3 years ago.
I am still learning how to shell script and I have been given a challenge to make it easier for me to echo "Name1" "Name2"..."Name15" and I'm not too sure where to start, I've had ideas but I don't want to look silly if I mess it up. Any help?
I haven't actually tried anything just yet it's all just been mostly thought.
#This is what I wrote to start
#!/bin/bash
echo "Name1"
echo "Name2"
echo "Name3"
echo "Name4"
echo "Name5"
echo "Name6"
echo "Name7"
echo "Name8"
echo "Name9"
echo "Name10"
echo "Name11"
echo "Name12"
echo "Name13"
echo "Name14"
echo "Name15"
My expected results are obviously just for it to output "Name1" "Name2" etc. But I'm looking for a more creative way to do it. If possible throw in a few ways to do it so I can learn. Thank you.
The easiest (possibly not the most creative) way to do this is to use printf:
printf "%s\n" name{1..15}
This relies on bash brace expansion {1..15} to have the 15 strings.
Use a for loop
for i in {1..15};do echo "Name$i";done
A few esoteric solutions, from the least to the most unreasonable :
base64 encoded string :
base64 -d <<<TmFtZTEKTmFtZTIKTmFtZTMKTmFtZTQKTmFtZTUKTmFtZTYKTmFtZTcKTmFtZTgKTmFtZTkKTmFtZTEwCk5hbWUxMQpOYW1lMTIKTmFtZTEzCk5hbWUxNApOYW1lMTUK
The weird chain is your expected result encoded in base64, an encoding generally used to represent binary data as text. base64 -d <<< weirdChain is passing the weird chain as input to the base64 tool and asking it to decode it, which displays your expected result
generate an infinite stream of "Name", truncate it, use line numbers :
yes Name | awk 'NR == 16 { exit } { printf("%s%s\n", $0, NR) }'
yes outputs an infinite stream of what it's passed as argument (or y by default, used to automatize interactive scripts asking for [y/n] confirmation). The awk command exits once it reaches the 16th line, and otherwise prints its input (provided by yes) followed by the line number. The truncature could as easily be done with head -15, and I've tried using the nl "number line" utility or grep -n to number lines, but they always added the line numbers as prefix which required an extra re-formatting step.
read random binary data and hope to stumble on all the lines you want to output :
timeout 1d strings /dev/urandom | grep -Eo "Name(1[0-5]|[1-9])" | sort -uV
strings /dev/urandom will extract ascii sequences from the binary random source /dev/urandom, grep will filter those which respect the format of a line of your expected output and sort will reorder those lines in the correct order. Since sort needs to have a received its whole input before it reorders it and /dev/urandom won't stop producing data, we use timeout 1d to stop reading from /dev/urandom after a whole day in hope it has sifted through enough random data to find your 15 lines (I'm not sure that's even remotely likely).
use an HTTP client to retrieve this page, extract the bash script you posted and execute it.
my_old_script=$(curl "https://stackoverflow.com/questions/57818680/" | grep "#This is what I wrote to start" -A 18 | tail -n+4)
eval "$my_old_script"
curl is a command line tool that can be used as an HTTP client, grep with its -A 18 parameter will select the "This is what I wrote to start" text and the 18 lines that follow, tail will remove the first 3 lines, and eval will execute your script.
While it will be much more efficient than the previous solution, it's an even less reasonable solution because high-rep users can edit your question to make this solution execute arbitrary code on your computer. Ideally you'd be using an HTML-aware parser rather than basic string manipulation to extract the code, but we're not talking about best practices here...

how to reuse stdout output without saving it to physical disk file

I have a for-loop, like the following:
for inf from $filelist; do
for ((i=0; i<imax; ++i)); do
temp=`<command_1> $inf | <command_2>`
eval set -A array -- $temp
...
done
...
done
Problem is, command_1 a bit time consuming and its output is a bit large (900MB is the highest, depending on how big the input file is). So, I modified the script to:
outf="./temp"
for inf from $filelist; do
<command_1> $inf -o $outf
for ((i=0; i<imax; ++i)); do
temp=`cat $outf | <command_2>`
eval set -A array -- $temp
...
done
...
done
There is a little performance improvement, but not so much as I want, probably because disk I/O is a performance bottle-neck as well.
Just curious if there is a way to save the stdout output of command_1, so that I could reuse it without saving it to a physical disk file?
don't use pipelines inside nested loops
Based on new comments and another look at the original question, I would strongly recommend against using a pipeline processing large amounts of data inside a nested loop. Shell pipelines are far from efficient, and incur lots of process overhead.
Look at the original problem, this involves looking into the contributions of command_1 and command_2, and see if you could solve this in another way.
That said: here's the original answer:
In the shell there are two ways of storing data: either in a shell variable, or in a file. You might try to store that file in a memory based filesystem, like /dev/shm on linux or tmpfs in Solaris.
You might also analyse command_1 and command_2 for optimisations. Is there anything in the output of command_1 that's not needed by command_2? Try to put a filter between the two.
Example:
command_1 | awk '{ print $2 }' | command_2
(Assuming command_2 only needs column 2 of command_1's output.)

Handle special characters in bash for...in loop

Suppose I've got a list of files
file1
"file 1"
file2
a for...in loop breaks it up between whitespace, not newlines:
for x in $( ls ); do
echo $x
done
results:
file
1
file1
file2
I want to execute a command on each file. "file" and "1" above are not actual files. How can I do that if the filenames contains things like spaces or commas?
It's a little trickier than I think find -print0 | xargs -0 could handle, because I actually want the command to be something like "convert input/file1.jpg .... output/file1.jpg" so I need to permutate the filename in the process.
Actually, Mark's suggestion works fine without even doing anything to the internal field separator. The problem is running ls in a subshell, whether by backticks or $( ) causes the for loop to be unable to distinguish between spaces in names. Simply using
for f in *
instead of the ls solves the problem.
#!/bin/bash
for f in *
do
echo "$f"
done
UPDATE BY OP: this answer sucks and shouldn't be on top ... #Jordan's post below should be the accepted answer.
one possible way:
ls -1 | while read x; do
echo $x
done
I know this one is LONG past "answered", and with all due respect to eduffy, I came up with a better way and I thought I'd share it.
What's "wrong" with eduffy's answer isn't that it's wrong, but that it imposes what for me is a painful limitation: there's an implied creation of a subshell when the output of the ls is piped and this means that variables set inside the loop are lost after the loop exits. Thus, if you want to write some more sophisticated code, you have a pain in the buttocks to deal with.
My solution was to take the "readline" function and write a program out of it in which you can specify any specific line number that you may want that results from any given function call. ... As a simple example, starting with eduffy's:
ls_output=$(ls -1)
# The cut at the end of the following line removes any trailing new line character
declare -i line_count=$(echo "$ls_output" | wc -l | cut -d ' ' -f 1)
declare -i cur_line=1
while [ $cur_line -le $line_count ] ;
do
# NONE of the values in the variables inside this do loop are trapped here.
filename=$(echo "$ls_output" | readline -n $cur_line)
# Now line contains a filename from the preceeding ls command
cur_line=cur_line+1
done
Now you have wrapped up all the subshell activity into neat little contained packages and can go about your shell coding without having to worry about the scope of your variable values getting trapped in subshells.
I wrote my version of readline in gnuc if anyone wants a copy, it's a little big to post here, but maybe we can find a way...
Hope this helps,
RT

Resources