bash: cat the first lines of a file & get position - bash

I got a very big file that contains n lines of text (with n being <1000) at the beginning, an empty line and then lots of untyped binary data.
I would like to extract the first n lines of text, and then somehow extract the exact offset of the binary data.
Extracting the first lines is simple, but how can I get the offset? bash is not encoding aware, so just counting up the number of characters is senseless.

grep has an option -b to output the byte offset.
Example:
$ hexdump -C foo
00000000 66 6f 6f 0a 0a 62 61 72 0a |foo..bar.|
00000009
$ grep -b "^$" foo
4:
$ hexdump -s 5 -C foo
00000005 62 61 72 0a |bar.|
00000009
In the last step I used 5 instead of 4 to skip the newline.
Also works with umlauts (äöü) in the file.

Use grep to find the empty line
grep -n "^$" your_file | tr -d ':'
Optionally use tail -n 1 if you want the last empty line (that is, if the top part of the file can contain empty lines before the binary stuff starts).
Use head to get the top part of the file.
head -n $num

you might want to use tools like hexdump or od to retrieve binary offsets instead of bash. Here's a reference.

Perl can tell you where you are in a file:
pos=$( perl -le '
open $fh, "<", $ARGV[0];
$/ = ""; # read the file in "paragraphs"
$first_paragraph = <$fh>;
print tell($fh)
' filename )
Parenthetically, I was attempting to one-liner this
pos=$( perl -00 -lne 'if ($. == 2) {print tell(___what?___); exit}' filename
What is the "current filehandle" variable? I couldn't find it in the docs.

Related

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

replace string in text file with random characters

So what i'm trying to do is this: I've been using keybr.com to sharpen my typing skills and on this site you can "provide your own custom text." Now i've been taking chapters out of books to type so its a little more interesting than just typing groups of letters. Now I want to also insert numbers into the text. Specifically, between each word have something like "393" and random sets smaller and larger than that example.
so i have saved a chapter of a book into a file in my home folder. Now i just need a command to search for spaces and input a group of numbers and add a space so a sentence would look like this: The 293 dog 328 is 102 black. 334 The... etc.
I have looked up linux commands through search engines and i've found out how to replace strings in text files with:
sed -i 's/original/new/g' file.txt
and how to generate random numbers with:
$ shuf -i MIN-MAX -n COUNT
i just can not figure out how to output a one line command that will have random numbers between each word. I'm still-a-searching so thanks to anyone that takes the time to read my problem.
Perl to the rescue!
perl -pe 's/ /" " . (100 + int rand 900) . " "/ge' < input.txt > output.txt
-p reads the input line by line, after reading a line, it runs the code and prints the line to the output
s/// is similar to the substitution you know from sed
/g means global, i.e. it substitutes as many times as possible
/e means the replacement part is a code to run. In this case, the code generates a random number (100-999).
Given:
$ echo "$txt"
Here is some random words. Please
insert a number a space between each one.
Here is a simple awk to do that:
$ echo "$txt" | awk '{for (i=1;i<=NF;i++) printf "%s %d ", $i, rand()*100; print ""}'
Here 92 is 59 some 30 random 57 words. 74 Please 78
insert 43 a 33 number 77 a 10 space 78 between 83 each 76 one. 49
And here is roughly the same thing in pure Bash:
while read -r line; do
for word in $line; do
printf "%s %s" "$word $((1+$RANDOM % 100))"
done
echo
done < <(echo "$txt")

N combinations of words, simple bash

File 1:
1F
2F
3F
4F
5f
File 2:
1F
2F
3F
4F
5f
I have a simple code that produces all possible combinations
#!/bin/bash
for a in $(awk '{print $1}' intf1)
do
for b in $(awk '{print $1}' intf2)
do
echo -e "$a:$b" >> file
done
done
Output of this code:
1F:1F
1F:2F
1F:3F
1F:4F
2F:1F
etc
But I would like to:
1) Completely avoid repetitions
2) "Select the number" (the number of words (lines) which he will be taken from the second file):
Each two lines in second file:
1F:2F
1F:3F
2F:3F
2F:4F
3F:4F
3F:5F
4F:5F
Each three lines in second file:
1F:2F
1F:3F
1F:4F
2F:3F
2F:4F
2F:5F
etc..
And etc
If the files could be sorted (and exclude repeating lines), this could work:
printf "%s\n" $(eval "echo {$(sort -u file.txt | paste -sd, -)}:{$(sort -u file2.txt | head -2 | paste -sd, -)}") | sort -u
It uses the bash expansion to generate the combinations, like:
$ echo {a,b}{X,Y,Z}
aX aY aZ bX bY bZ
also, you must ABSOLUTELY trust the content of the files, because the code uses dangerous eval.
The argument to head like head -2 could be used for limiting the cound of lines from the file2.txt
The code produces (by limiting the second file to 2 already sorted lines) the following:
20160702F:20160702F
20160702F:20160714F
20160714F:20160702F
20160714F:20160714F
20160807F:20160702F
20160807F:20160714F
20160819F:20160702F
20160819F:20160714F
20160831F:20160702F
20160831F:20160714F
20160912F:20160702F
20160912F:20160714F

redirect stdout to script, so it can be parsed and then sent to stdout

I have a (java) program that prints a line of hex numbers to stdout every 5ish seconds, until the program is terminated by the user.
I would like to redirect that output to a bash script so I could convert each of those hex numbers independently to decimal, then print the parsed line to stdout.
I tried using myProgram | myScript but that did the piping before any lines were printed, then didn't keep listening to stdout. I then tried myProgram > myScript, and that just overwrote the script.
Ideas?
Edit: adding output from the runs, (sorry for the poor formatting, I couldn't get it all in the code highlighting) so the middle of the output is not highighted).
Here is the script
#!/bin/bash
echo $0
echo $#
echo $1
Here is how my program runs while it goes straight to stdout this would continue forever if I didn't terminate it.
mmmm#mmmm:~/mmmm/mmmm/mmmmm$ java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
serial#/dev/ttyUSB0:57600: resynchronising
00 FF FF 00 02 04 22 93 00 02 02 C9
00 FF FF 00 03 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 02 02 C9
^Z
[5]+ Stopped java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
Here is where I try to pipe it to my script (which i have set to print the number of command line arguments and the first argument. It just freeze after this...
mmmm#mmmm:~/mmmm/mmmm/mmmmm$$ java net.tinyos.tools.Listen -comm serial#/dev/ttyUSB0:micaz | ./parser.sh
./parser.sh
0
serial#/dev/ttyUSB0:57600: resynchronising
Diagnosis
When you use this script like this:
java javaprog | myScript
and myScript contains:
#!/bin/bash
echo $0
echo $#
echo $1
Then the output from the script will be its name (myScript) from the echo $0, the number of arguments it was passed (0) from the echo $#, and the first argument (an empty line is echoed) from the echo $1. The script then exits (successfully). The issue is nothing to do with buffering; it is all to do with the script not reading anything from its standard input. Even a trivial modification would be an improvement:
#!/bin/bash
while read data; do echo $data; done
That's a slower form of cat, except that it normalizes random sequences of spaces and tabs into single spaces, stripping leading and trailing spaces off the line. It would at least demonstrate the script processing the output from the Java program.
Trying awk
To do what you're after, you should probably replace that with an awk program or something similar. This is a first draft, but it stands some chance of working:
awk '{for(i = 1; i <= NF; i++) { x = "0x" $i + 0; printf(" %d", x); printf "\n";}'
This says 'for each line (because there is no pattern before the open brace)', do 'for each of the fields 1..NF, convert the field into an explicit hex string with the 0x prefix and adding 0, then print the value as a decimal number (trusting awk to convert a string such as '0xC9' to a number).
Using Perl
Unfortunately, a little testing shows that this does not work; the problem is getting a value other than 0 for x. So, ... time to fall back on Perl in awk-emulation mode:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'for ($i = 0; $i < scalar(#F); $i++) { printf(" %d", hex $F[$i]); }
> printf "\n";'
0 201 40 19 160 255 1
$
That works - it's even fairly easy to understand. The -n option means 'read each line of data and execute the commands in the script on each line (but do not print $_ at the end)'. The -a option combined with either -n (as here, or -p which is like -n except it prints $_ automatically) means 'automatically split the input into the array #F. The script then processes each element of #F in each line (rather verbosely), using the hex function to convert the string in $F[$i] to a number and then printing that number with printf(). The verbosity can be reduced (this is Perl: There's More Than One Way To Do It, or TMTOWTDI - tim-toady) with:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'foreach my $i (#F) { printf(" %d", hex $i); } printf "\n";'
0 201 40 19 160 255 1
$
Same result, less code. There might be more abbreviated techniques; that's compact enough without being wholly illegible.
\1. check if your system has the unbuffer command installed
which unbuffer
(typically systems that are using bash are Linux-based, and have unbuffer available)
\2. If yes,
unbuffer myProgram | myScript
edit
As you have shown us your shell script as
#!/bin/bash
echo $0
echo $#
echo $1
Please recall that the values you are echoing, $0, $#, $1 are positional parameters to bash related to the command line arguments. Typically options or filenames for processing.
To print the whole line, the # of fields on the line, and the value of the first line, awk is a perfect solution to this problem.
Try changing your script to
cat myScript.awk
#!/bin/awk -f
{
print $0
print $NF
print $1
}
chmod 755 myScript.awk
Hmm.. Seeing ^Z to stop input tells me you are using Windows or are you using bash under Cygwin?
I hope this helps.
This might be a buffering issue. The GNU Coreutils come with a tool called stdbuf. If it is available on your system, try running:
stdbuf -o0 program | stdbuf -i0 script

How to use "cmp" to compare two binaries and find all the byte offsets where they differ?

I would love some help with a Bash script loop that will show all the differences between two binary files, using just
cmp file1 file2
It only shows the first change I would like to use cmp because it gives a offset an a line number of where each change is but if you think there's a better command I'm open to it :) thanks
I think cmp -l file1 file2 might do what you want. From the manpage:
-l --verbose
Output byte numbers and values of all differing bytes.
The output is a table of the offset, the byte value in file1 and the value in file2 for all differing bytes. It looks like this:
4531 66 63
4532 63 65
4533 64 67
4580 72 40
4581 40 55
[...]
So the first difference is at offset 4531, where file1's decimal octal byte value is 66 and file2's is 63.
Method that works for single byte addition/deletion
diff <(od -An -tx1 -w1 -v file1) \
<(od -An -tx1 -w1 -v file2)
Generate a test case with a single removal of byte 64:
for i in `seq 128`; do printf "%02x" "$i"; done | xxd -r -p > file1
for i in `seq 128`; do if [ "$i" -ne 64 ]; then printf "%02x" $i; fi; done | xxd -r -p > file2
Output:
64d63
< 40
If you also want to see the ASCII version of the character:
bdiff() (
f() (
od -An -tx1c -w1 -v "$1" | paste -d '' - -
)
diff <(f "$1") <(f "$2")
)
bdiff file1 file2
Output:
64d63
< 40 #
Tested on Ubuntu 16.04.
I prefer od over xxd because:
it is POSIX, xxd is not (comes with Vim)
has the -An to remove the address column without awk.
Command explanation:
-An removes the address column. This is important otherwise all lines would differ after a byte addition / removal.
-w1 puts one byte per line, so that diff can consume it. It is crucial to have one byte per line, or else every line after a deletion would become out of phase and differ. Unfortunately, this is not POSIX, but present in GNU.
-tx1 is the representation you want, change to any possible value, as long as you keep 1 byte per line.
-v prevents asterisk repetition abbreviation * which might interfere with the diff
paste -d '' - - joins every two lines. We need it because the hex and ASCII go into separate adjacent lines. Taken from: Concatenating every other line with the next
we use parenthesis () to define bdiff instead of {} to limit the scope of the inner function f, see also: How to define a function inside another function in Bash?
See also:
https://superuser.com/questions/125376/how-do-i-compare-binary-files-in-linux
https://unix.stackexchange.com/questions/59849/diff-binary-files-of-different-sizes
The more efficient workaround I've found is to translate binary files to some form of text using od.
Then any flavour of diff works fine.

Resources