incrementing a number in bash with leading 0 - bash

I am a very newbie to bash scripting and am trying to write some code to parse and manipulate a file that I am working on.
I need to increment and decrement the minute of a time for a bunch of different times in a file. My problem happens when the time is for example 2:04 or 14:00.
File Example:
2:43
2:05
15:00
My current excerpt from my bash script is like this
for x in `cat $1`;
do minute_var=$(echo $x | cut -d: -f2);
incr_min=$(($minute_var + 1 | bc));
echo $incr_min;
done
Current Result:
44
6
1
Required Result:
44
06
01
Any suggestions

Use printf:
incr_min=$(printf %02d $(($minute_var + 1 )) )
No that bc is not needed if only integers are involved.

is this ok for your requirement?
kent$ echo "2:43
2:05
15:00"|awk -F: '{$2++;printf "%02d\n", $2}'
44
06
01

while IFS=: read hour min; do
printf "%02d\n" $((10#$min + 1))
done <<END
2:43
2:05
15:00
8:08
0:59
END
44
06
01
09
60
For the minute wrapping to the next hour, use a language with time functions, like gawk
awk -F: '{
time = mktime("1970 01 01 " $1 " " $2 " 00")
time += 60
print strftime("%M", time)
}'
perl -MTime::Piece -MTime::Seconds -nle '
    $t = Time::Piece->strptime($_, "%H:%M");
    print +($t + ONE_MINUTE)->strftime("%M");
'

UPDATED #2
There are some problems with your script. At first instead of `cat file` you should use `<file` or rather $(<file). One fork and exec call is spared as bash simply opens the file. On the other hand calling cut and bc (and printf) also not needed as bash has internally the proper features. So you can spare some forks and execs again.
If the input file is large (greater then cca 32 KiB) then the for-loop line can be too large to be processed by bash so I suggest to use while-loop instead and read the file line-by-line.
I could suggest something like this in pure bash (applied Atle's substr solution):
while IFS=: read hr min; do
incr_min=$((1$min+1)); #Octal problem solved
echo ${incr_min: -2}; #Mind the space before -2!
#or with glennjackman's suggestion to use decimal base
#incr_min=0$((10#$min+1))
#echo ${incr_min: -2};
#or choroba's solution improved to set variable directly
#printf -v incr_min %02d $((10#$min+1))
#echo $incr_min
done <file
Input file
$ cat file
2:43
2:05
15:00
12:07
12:08
12:09
Output:
44
06
01
08
09
10
Maybe the printf -v is the simplest as it puts the result to the variable in a single step.
Good question from tripleee what should happen if the result is 60.

Use printf to reformat the output to be zero-padded, 2-wide:
incr_min=$(printf %02d $incr_min)

Here's a solution that
wraps the seconds from 59 to 0
is fully POSIX compliant--no bashisms!
doesn't need a single fork thus is extremely fast
$ cat x
2:43
2:05
2:08
2:09
15:00
15:59
$ while IFS=: read hr min; do
printf '%02d\n' $(((${min#0}+1)%60))
done < x
44
06
09
10
01
00

Try this:
for x in $(<$1); do
printf "%02d\n" $(((${x#*:}+1)%60));
done

Padding with 0, and getting two last characters:
for x in `cat $1`;
do minute_var=$(echo $x | cut -d: -f2);
incr_min=0$(($minute_var + 1 | bc));
echo ${incr_min: -2:2};
done

Related

Bash compute the letter suffix from the split command (i.e. integer into base 26 with letters)

The split command produces by default a file suffix of the form "aa" "ab" ... "by" "bz"...
However in a script, I need to recover this suffix, starting from the file number as an integer (without globbing).
I wrote the following code, but maybe bash wizards here have a more concise solution?
alph="abcdefghijklmnopqrstuvwxyz"
for j in {0..100}; do
# Convert j to the split suffix (aa ab ac ...)
first=$(( j / 26 ))
sec=$(( j % 26 ))
echo "${alph:$first:1}${alph:$sec:1}"
done
Alternatively, I could use bc with the obase variable, but it only outputs one number in case j<26.
bc <<< 'obase=26; 5'
# 05
bc <<< 'obase=26; 31'
# 01 05
Use this Perl one-liner and specify the file numbers (0-indexed) as arguments, for example:
perl -le 'print for ("aa".."zz")[#ARGV]' 0 25 26
Output:
aa
az
ba
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
#ARGV : array of the command-line arguments.
From top of my head, depending on 97 beeing ASCII a:
printf "\x$(printf %x $((97+j/26)))\x$(printf %x $((97+j%26)))\n"
printf "\\$(printf %o $((97+j/26)))\\$(printf %o $((97+j%26)))\n"
awk "BEGIN{ printf \"%c%c\\n\", $((97+j/26)), $((97+j%26))}" <&-
printf %x $((97+j/26)) $((97+j%26)) | xxd -r -p
You could also just write without temporary variables:
echo "${alph:j/26:1}${alph:j%26:1}"
In my use case, I do want to generate the full list
awk should be fast:
awk 'BEGIN{ for (i=0;i<=100;++i) printf "%c%c\n", 97+i/26, 97+i%26}' <&-

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

How to store directory files listing into an array?

I'm trying to store the files listing into an array and then loop through the array again.
Below is what I get when I run ls -ls command from the console.
total 40
36 -rwxrwxr-x 1 amit amit 36720 2012-03-31 12:19 1.txt
4 -rwxrwxr-x 1 amit amit 1318 2012-03-31 14:49 2.txt
The following bash script I've written to store the above data into a bash array.
i=0
ls -ls | while read line
do
array[ $i ]="$line"
(( i++ ))
done
But when I echo $array, I get nothing!
FYI, I run the script this way: ./bashscript.sh
I'd use
files=(*)
And then if you need data about the file, such as size, use the stat command on each file.
Try with:
#! /bin/bash
i=0
while read line
do
array[ $i ]="$line"
(( i++ ))
done < <(ls -ls)
echo ${array[1]}
In your version, the while runs in a subshell, the environment variables you modify in the loop are not visible outside it.
(Do keep in mind that parsing the output of ls is generally not a good idea at all.)
Here's a variant that lets you use a regex pattern for initial filtering, change the regex to be get the filtering you desire.
files=($(find -E . -type f -regex "^.*$"))
for item in ${files[*]}
do
printf " %s\n" $item
done
This might work for you:
OIFS=$IFS; IFS=$'\n'; array=($(ls -ls)); IFS=$OIFS; echo "${array[1]}"
Running any shell command inside $(...) will help to store the output in a variable. So using that we can convert the files to array with IFS.
IFS=' ' read -r -a array <<< $(ls /path/to/dir)
You may be tempted to use (*) but what if a directory contains the * character? It's very difficult to handle special characters in filenames correctly.
You can use ls -ls. However, it fails to handle newline characters.
# Store la -ls as an array
readarray -t files <<< $(ls -ls)
for (( i=1; i<${#files[#]}; i++ ))
{
# Convert current line to an array
line=(${files[$i]})
# Get the filename, joining it together any spaces
fileName=${line[#]:9}
echo $fileName
}
If all you want is the file name, then just use ls:
for fileName in $(ls); do
echo $fileName
done
See this article or this this post for more information about some of the difficulties of dealing with special characters in file names.
My two cents
The asker wanted to parse output of ls -ls
Below is what I get when I run ls -ls command from the console.
total 40
36 -rwxrwxr-x 1 amit amit 36720 2012-03-31 12:19 1.txt
4 -rwxrwxr-x 1 amit amit 1318 2012-03-31 14:49 2.txt
But there are few answer addressing this parsing operation.
ls's output
Before trying to parse something, we have to ensure command output is consistant, stable and easy to parse as possible
In order to ensure output wont be altered by some alias you may prefer to specify full path of command: /bin/ls.
Avoid variations of output due to locales, prefix your command by LANG=C LC_ALL=C
Use --time-style command switch to use UNIX EPOCH more easier to parse time infos.
Use -b switch for holding special characters
So we will prefer
LANG=C LC_ALL=C /bin/ls -lsb --time-style='+%s.%N'
to just
ls -ls
Full bash sample
#!/bin/bash
declare -a bydate=() bysize=() byname=() details=()
declare -i cnt=0 vtotblk=0 totblk
{
read -r _ totblk # ignore 1st line
while read -r blk perm lnk usr grp sze date file;do
byname[cnt]="${file//\\ / }"
details[cnt]="$blk $perm $lnk $usr $grp $sze $date"
bysize[sze]+="$cnt "
bydate[${date/.}]+="$cnt "
cnt+=1 vtotblk+=blk
done
} < <(LANG=C LC_ALL=C /bin/ls -lsb --time-style='+%s.%N')
From there, you could easily sort by dates, sizes of names (sorted by ls command).
echo "Path '$PWD': Total: $vtotblk, sorted by dates"
for dte in ${!bydate[#]};do
printf -v msec %.3f .${dte: -9}
for idx in ${bydate[dte]};do
read -r blk perm lnk usr grp sze date <<<"${details[idx]}"
printf ' %11d %(%a %d %b %T)T%s %s\n' \
$sze "${date%.*}" ${msec#0} "${byname[idx]}"
done
done
echo "Path '$PWD': Total: $vtotblk, sorted by sizes"
for sze in ${!bysize[#]};do
for idx in ${bysize[sze]};do
read -r blk perm lnk usr grp sze date <<<"${details[idx]}"
printf -v msec %.3f .${date#*.}
printf ' %11d %(%a %d %b %T)T%s %s\n' \
$sze "${date%.*}" ${msec#0} "${byname[idx]}"
done
done
echo "Path '$PWD': Total: $vtotblk, sorted by names"
for((idx=0;idx<cnt;idx++));{
read -r blk perm lnk usr grp sze date <<<"${details[idx]}"
printf -v msec %.3f .${date#*.}
printf ' %11d %(%a %d %b %T)T%s %s\n' \
$sze "${date%.*}" ${msec#0} "${byname[idx]}"
}
( Accessory, you could check if total block printed by ls match total block by lines:
(( vtotblk == totblk )) ||
echo "WARN: Total blocks: $totblk != Block count: $vtotblk" >&2
Of course, this could be inserted before first echo "Path...;)
Here is an output sample. (Note: there is a filename with a newline)
Path '/tmp/so': Total: 16, sorted by dates
0 Sun 04 Sep 10:09:18.221 2.txt
247 Mon 05 Sep 09:11:50.322 Filename with\nsp\303\251cials characters
13 Mon 05 Sep 10:12:24.859 1.txt
1313 Mon 05 Sep 11:01:00.855 parseLs.00
1913 Thu 08 Sep 08:20:20.836 parseLs
Path '/tmp/so': Total: 16, sorted by sizes
0 Sun 04 Sep 10:09:18.221 2.txt
13 Mon 05 Sep 10:12:24.859 1.txt
247 Mon 05 Sep 09:11:50.322 Filename with\nsp\303\251cials characters
1313 Mon 05 Sep 11:01:00.855 parseLs.00
1913 Thu 08 Sep 08:20:20.836 parseLs
Path '/tmp/so': Total: 16, sorted by names
13 Mon 05 Sep 10:12:24.859 1.txt
0 Sun 04 Sep 10:09:18.221 2.txt
247 Mon 05 Sep 09:11:50.322 Filename with\nsp\303\251cials characters
1913 Thu 08 Sep 08:20:20.836 parseLs
1313 Mon 05 Sep 11:01:00.855 parseLs.00
And if you want to format characters (with care: there could be some issues, if you don't know who create content of path). But if folder is your, you could:
echo "Path '$PWD': Total: $vtotblk, sorted by dates, with special chars"
printf -v spaces '%*s' 37 ''
for dte in ${!bydate[#]};do
printf -v msec %.3f .${dte: -9}
for idx in ${bydate[dte]};do
read -r blk perm lnk usr grp sze date <<<"${details[idx]}"
printf ' %11d %(%a %d %b %T)T%s %b\n' $sze \
"${date%.*}" ${msec#0} "${byname[idx]//\\n/\\n$spaces}"
done
done
Could output:
Path '/tmp/so': Total: 16, sorted by dates, with special chars
0 Sun 04 Sep 10:09:18.221 2.txt
247 Mon 05 Sep 09:11:50.322 Filename with
spécials characters
13 Mon 05 Sep 10:12:24.859 1.txt
1313 Mon 05 Sep 11:01:00.855 parseLs.00
1913 Thu 08 Sep 08:20:20.836 parseLs
Isn't these 2 code lines, either using scandir or including the dir pull in the declaration line, supposed to work?
src_dir="/3T/data/MySQL";
# src_ray=scandir($src_dir);
declare -a src_ray ${src_dir/*.sql}
printf ( $src_ray );
In the conversation over at https://stackoverflow.com/a/9954738/11944425
the behavior can be wrapped into a convenience function which applies some action to entries of the directory as string values.
#!/bin/bash
iterfiles() {
i=0
while read filename
do
files[ $i ]="$filename"
(( i++ ))
done < <( ls -l )
for (( idx=0 ; idx<${#files[#]} ; idx++ ))
do
$# "${files[$idx]}" &
wait $!
done
}
where $# is the complete glob of arguments passed to the function! This lets the function have the utility to take an arbitrary command as a partial function of sorts to operate on the filename:
iterfiles head -n 1 | tee -a header_check.out
When a script needs to iterate over files, returning an array of them is not possible. The workaround is to define the array outside of the function scope (and possibly unset it later) — modifying it inside the function's scope. Then, after the function is called by a script, the array variable becomes available. For instance, the mutation on files demonstrates how this could be done.
declare -a files # or just `files= ` (nothing)
iterfiles() {
# ...
files=...
}
Extending the conversation above, #Jean-BaptistePoittevin pointed out a valuable detail.
#!/bin/bash
# Adding a section to unset certain variable names that
# may already be active in the shell.
unset i
unset files
unset omit
i=0
omit='^([\n]+)$'
while read file
do
files[ $i ]="$file"
(( i++ ))
done < <(ls -l | grep -Pov ${omit} )
Note: This can be tested using echo ${files[0]} or for entry in ${files[#]}; do ... ; done
Often times, the circumstance could require an absolute path in double quotes, where the file (or ancestor directories) have spaces or unusual characters in the name. find is one answer here. The simplest usage might look like the above one, except done < <(ls -l ... ) is replaced with:
done < <(find /path/to/directory ! -path /path/to/directory -type d)
Its convenient when you need absolute paths in double quotes as an iterable collection to use a recipe like the one below. When export is not used, the shell does not update the environment namespace to include it in the find subshell:
#!/bin/bash
export DIRECTORY="$PWD" # For example
declare -a files
i=0
while read filename; do
files[ $i ]="$filename"
done < <(find $DIRECTORY ! -path $DIRECTORY -type d)
for (( idx=0; idx<${#files[#]}; idx++ )); do
# Make a templated string for macro script generation
quoted_path="\"${files[$idx]}\""
if [[ "$(echo $quoted_path | grep some_substring | wc -c)" != "0" ]]; then
echo "mv $quoted_path /some/other/watched/folder/" >> run_nightly.sh
fi
done
Upon running this, ./run_nightly.sh will be populated with bulk commands to move a quoted path to /some/other/watched/folder/. This kind of scripting pattern will make it possible to supercharge your scripts.
simply you can use this below for loop (do not forget to quote to handle filenames with spaces)
declare -a arr
arr=()
for file in "*.txt"
do
arr=(${arr[*]} "$file")
done
Run
for file in ${arr[*]}
do
echo "<$file>"
done
to test.

redirect stdout to script, so it can be parsed and then sent to stdout

I have a (java) program that prints a line of hex numbers to stdout every 5ish seconds, until the program is terminated by the user.
I would like to redirect that output to a bash script so I could convert each of those hex numbers independently to decimal, then print the parsed line to stdout.
I tried using myProgram | myScript but that did the piping before any lines were printed, then didn't keep listening to stdout. I then tried myProgram > myScript, and that just overwrote the script.
Ideas?
Edit: adding output from the runs, (sorry for the poor formatting, I couldn't get it all in the code highlighting) so the middle of the output is not highighted).
Here is the script
#!/bin/bash
echo $0
echo $#
echo $1
Here is how my program runs while it goes straight to stdout this would continue forever if I didn't terminate it.
mmmm#mmmm:~/mmmm/mmmm/mmmmm$ java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
serial#/dev/ttyUSB0:57600: resynchronising
00 FF FF 00 02 04 22 93 00 02 02 C9
00 FF FF 00 03 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 02 02 C9
^Z
[5]+ Stopped java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
Here is where I try to pipe it to my script (which i have set to print the number of command line arguments and the first argument. It just freeze after this...
mmmm#mmmm:~/mmmm/mmmm/mmmmm$$ java net.tinyos.tools.Listen -comm serial#/dev/ttyUSB0:micaz | ./parser.sh
./parser.sh
0
serial#/dev/ttyUSB0:57600: resynchronising
Diagnosis
When you use this script like this:
java javaprog | myScript
and myScript contains:
#!/bin/bash
echo $0
echo $#
echo $1
Then the output from the script will be its name (myScript) from the echo $0, the number of arguments it was passed (0) from the echo $#, and the first argument (an empty line is echoed) from the echo $1. The script then exits (successfully). The issue is nothing to do with buffering; it is all to do with the script not reading anything from its standard input. Even a trivial modification would be an improvement:
#!/bin/bash
while read data; do echo $data; done
That's a slower form of cat, except that it normalizes random sequences of spaces and tabs into single spaces, stripping leading and trailing spaces off the line. It would at least demonstrate the script processing the output from the Java program.
Trying awk
To do what you're after, you should probably replace that with an awk program or something similar. This is a first draft, but it stands some chance of working:
awk '{for(i = 1; i <= NF; i++) { x = "0x" $i + 0; printf(" %d", x); printf "\n";}'
This says 'for each line (because there is no pattern before the open brace)', do 'for each of the fields 1..NF, convert the field into an explicit hex string with the 0x prefix and adding 0, then print the value as a decimal number (trusting awk to convert a string such as '0xC9' to a number).
Using Perl
Unfortunately, a little testing shows that this does not work; the problem is getting a value other than 0 for x. So, ... time to fall back on Perl in awk-emulation mode:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'for ($i = 0; $i < scalar(#F); $i++) { printf(" %d", hex $F[$i]); }
> printf "\n";'
0 201 40 19 160 255 1
$
That works - it's even fairly easy to understand. The -n option means 'read each line of data and execute the commands in the script on each line (but do not print $_ at the end)'. The -a option combined with either -n (as here, or -p which is like -n except it prints $_ automatically) means 'automatically split the input into the array #F. The script then processes each element of #F in each line (rather verbosely), using the hex function to convert the string in $F[$i] to a number and then printing that number with printf(). The verbosity can be reduced (this is Perl: There's More Than One Way To Do It, or TMTOWTDI - tim-toady) with:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'foreach my $i (#F) { printf(" %d", hex $i); } printf "\n";'
0 201 40 19 160 255 1
$
Same result, less code. There might be more abbreviated techniques; that's compact enough without being wholly illegible.
\1. check if your system has the unbuffer command installed
which unbuffer
(typically systems that are using bash are Linux-based, and have unbuffer available)
\2. If yes,
unbuffer myProgram | myScript
edit
As you have shown us your shell script as
#!/bin/bash
echo $0
echo $#
echo $1
Please recall that the values you are echoing, $0, $#, $1 are positional parameters to bash related to the command line arguments. Typically options or filenames for processing.
To print the whole line, the # of fields on the line, and the value of the first line, awk is a perfect solution to this problem.
Try changing your script to
cat myScript.awk
#!/bin/awk -f
{
print $0
print $NF
print $1
}
chmod 755 myScript.awk
Hmm.. Seeing ^Z to stop input tells me you are using Windows or are you using bash under Cygwin?
I hope this helps.
This might be a buffering issue. The GNU Coreutils come with a tool called stdbuf. If it is available on your system, try running:
stdbuf -o0 program | stdbuf -i0 script

How to get only the first ten bytes of a binary file

I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0's and \n's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?
To get the first 10 bytes, as noted already:
head -c 10
To get all but the first 10 bytes (at least with GNU tail):
tail -c+11
head -c 10 does the right thing here.
You can use the dd command to copy an arbitrary number of bytes from a binary file.
dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1
How to split a stream (or a file) under bash
Two answer here!
Reading SO request:
get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.
I understand:
How to split a file at specific point
As all answers here does access same file two time, instead of just split it!!
Here is my two cents:
The interesting thing using Un*x is considering every whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat, grep, awk, sed, python, perl ...) work as filters.
1. Using head or dd but in a single pass
{ head -c 10 >head_part; cat >tail_part;} <file
This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part and the rest goes to tail_part.
Note: second redirection >tail_part could be place outside of whole list ({ ...;}) as well...
You could do same, using dd:
{ dd count=1 bs=10 of=head_part; cat;} <file >tail_part
This stay more efficient than running two process of dd to open same file two times.
...And still use standard block size for the rest of file:
Another sample based on read by line:
Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r):
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
or, to drop empty last head line:
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
This will produce two files:
ls -l so_*.raw
-rw-r--r-- 1 root root 307 Apr 25 11:40 so_head.raw
-rw-r--r-- 1 root root 219 Apr 25 11:40 so_body.raw
grep www so_*.raw
so_body.raw:here.
so_head.raw:Location: http://www.google.com/
2. Pure bash way:
If the goal is to obtain values of first 10 bytes in a usable bash variable, here is a nice and efficient way:
Because ten byte are few, fork to head could be avoided. from Read a file by bytes in BASH:
read8() {
local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car
}
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < "$infile" >"$outfile"
This will create an array ${first10[#]} containing hexadecimal values of first ten bytes of $infile and store rest of data into $outfile.
declare -p first10
declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")
This was a PDF (%PDF -> 25 50 44 46)... Here's another sample:
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} <<<"Hello world!"
d!
As I didn't redirect output, string d! will be output on terminal.
echo ${first10[#]}
48 65 6C 6C 6F 20 77 6F 72 6C
printf '%b%b%b%b%b%b%b%b%b%b\n' ${first10[#]/#/\\x}
Hello worl
About binary
You said:
These are binary files and will likely have \0's and \n's throughout the first 10 bytes.
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < <(gzip <<<"Hello world!") >/dev/null
echo ${first10[#]}
1F 8B 08 00 00 00 00 00 00 03
( Sample with a \n at bottom of this ;)
As a function
read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
local -n result=${1:-first10} # 1st arg is array name
local -i _i
result=()
for ((_i=0;_i<${2:-10};_i++));do # 2nd arg is number of bytes
read8 result[_i] || { unset result[_i] ; return 1 ;}
done
cat
}
Then (here, I use the special character ⛶ for: there was no newline. ).
get10 pdf 4 <$infile >$outfile
printf %b ${pdf[#]/#/\\x}
%PDF⛶
echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4
get10 test 8 <<<'Hello world'
rld!
printf %b ${test[#]/#/\\x}
Hello Wo⛶
get10 test 24 <<<'Hello World!'
printf %b ${test[#]/#/\\x}
Hello World!
( And the last character printed is a \n! ;)
Final binary demo:
get10 test 256 < <(gzip <<<'Hello world!')
printf '%b' ${test[#]/#/\\x} | gunzip
Hello world!
printf " %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s\n" ${test[#]}
1F 8B 08 00 00 00 00 00 00 03 F3 48 CD C9 C9 57
28 CF 2F CA 49 51 E4 02 00 41 E4 A9 B2 0D 00 00
00
Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split, head, tail and/or dd.

Resources