Save next 32 bytes in hex (after search) - bash

I am searching all files on my drive for a given hexadecimal value, after it is found I need to copy and save the next 32 bytes after the found occurrence (there may be many occurrences in one file).
Right now I'm searching for files like this:
ggrep -obaRUP "\x01\x02\x03\x04" . > outputfile.txt
But this script retuns only file path. Preferably I'd like to use only standard Linux / Mac tools.

With -P (--perl-regexes) you can use the \K escape sequence to clear the matching buffer. Then match .{32} more chars(!):
LANG=C grep -obaRUP "\x01\x02\x03\x04\K.{32,32}" . > output.file
Note:
I'm using LANG=C to enforce a locale which is using single byte encoding, not utf-8. This is to make sure .{32} would not accidentally match unicode chars(!), but bytes instead.
The -P option is only supported by GNU grep (along with a few others used in your example)
You may want to open the output.file in a hex editor to actually see characters. For example hexdump, hd or xxd could be used.
Note, the above command will additionally print the filename and the line number / byte offset of the match. This is implicitly caused by using grep -R (recursive).
To get only the 32 bytes in the output, and nothing else, I suggest to use find:
find . -type f -exec grep -oaUP '\x01\x02\x03\x04\K.{32}' {} \;

My test was a little simple, but this worked for me.
$: IFS=: read -r file offset data <<< "$(grep -obaRUP "\x01\x02\x03\x04.{32}" .)"
$: echo "$file # $((offset+4)):[${data#????}]"
./x # 10:[HERE ARE THE THIRTY-TWO BYTES !!]
Rather than do a complicated look-behind, I just gabbed the ^A^B^C^D and the next 32 bytes, and stripped off the leading 4 bytes from the field.
#hek2mgl's \K makes all that unnecessary, though. Use -h to eliminate filenames.
$: grep -obahRUP "\x01\x02\x03\x04\K.{32}" .
10:HERE ARE THE THIRTY-TWO BYTES !!
Take out the -b if you don't want the offset.
$: grep -oahRUP "\x01\x02\x03\x04\K.{32}" .
HERE ARE THE THIRTY-TWO BYTES !!

Related

Shell: How can I posixly determine if a file contains one or more null characters?

There are some great answers that address removing null characters from a file- it seems like sed is probably the most effective way. However, all the other questions I have been able to find are concerned not with finding null characters but removing them.
There are certain questions that do provide valid solutions- however, I am having difficulty finding a POSIX-compliant solution that does not rely on GNUisms. The two solutions I've seen that work use cat with the -v option and grep with the -P option (neither of which shall be supported).
I make it a habit to delegate as much as possible to the shell, but the shell can't help me here because it is not possible to store a null character in a variable. External tools are the only option, but I can't even find a way with them when I adhere to POSIX-compliant options.
One possible way would be tr and wc:
[ "$(tr -cd '\0' < file | wc -c)" -ge 0 ]
Alternatively, od and grep will allow stopping on the first one without reading the rest of the file:
od -A n -t x1 file | grep -q 00
Use tr -d '\000' to trim nulls into a temporary file and use wc -c to get the number of characters in the file. If the temporary-file doesn't match (cmp -s) the original, that contained nulls, and the output from wc can be used to compute the number of nulls -- the point of the question.
p.s.: grep -P isn't POSIX either. Nor is the -C option found in POSIX.

Recursively find hexadecimal bytes in binary files

I'm using grep within bash shell to find a series of hexadecimal bytes in files:
$ find . -type f -exec grep -ri "\x5B\x27\x21\x3D\xE9" {} \;
The search works fine, although I know there's a limitation for matches when not using the -a option where results only return:
Binary file ./file_with_bytes matches
I would like to get the offset of the matching result, is this possible? I'm open to using another similar tool I'm just not sure what it would be.
There is actually an option in grep that is available to use
-b --byte-offset Print the 0-based byte offset within the input file
A simple example using this option:
$ grep -obarUP "\x01\x02\x03" /bin
prints out both the filename and byte offset of the matched pattern inside a directory
/bin/bash:772067:
/bin/bash:772099:
/bin/bash:772133:
/bin/bash:772608:
/bin/date:56160:
notice that find is actually not needed since the option -r has already taken care of the recursive file searching
Not at a computer, but use:
od -x yourFile
or
xxd yourFile
to get it dumped in hex with offsets on the left side.
Sometimes your search string may not be found because the characters do not appear contiguously but are split across two lines. You can pass the file through twice though, with the first 4 bytes chopped off the second time to make sure your string is found intact on one pass or the other. Then add the offest back on and sort and uniq the offsets.

WC on OSX - Return includes spaces

When I run the word count command in OSX terminal like wc -c file.txt I get the below answer that includes spaces padded before the answer. Does anyone know why this happens, or how I can prevent it?
18000 file.txt
I would expect to get:
18000 file.txt
This occurs using bash or bourne shell.
The POSIX standard for wc may be read to imply that there are no leading blanks, but does not say that explicitly. Standards are like that.
This is what it says:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
and does not mention the formats for the single-column options such as -c.
A quick check shows me that AIX, OSX, Solaris use a format which specifies the number of digits for the value — to align columns (and differ in the number of digits). HPUX and Linux do not.
So it is just an implementation detail.
I suppose it is a way of getting outputs to line up nicely, and as far as I know there is no option to wc which fine tunes the output format.
You could get rid of them pretty easily by piping through sed 's/^ *//', for example.
There may be an even simpler solution, depending on why you want to get rid of them.
At least under macOS/bash wc exhibits the behavior of outputting trailing positional TABs.
It can be avoided using expr:
echo -n "some words" | expr $(wc -c)
>> 10
echo -n "some words" | expr $(wc -w)
>> 2
Note: The -n prevents echoing a newline character which would count as 1 in wc -c
This bugs me every time I write a script that counts lines or characters. I wish that wc were defined not to emit the extra spaces, but it's not, so we're stuck with them.
When I write a script, instead of
nlines=`wc -l $file`
I always say
nlines=`wc -l < $file`
so that wc's output doesn't include the filename, but that doesn't help with the extra spaces. The trick I use next is to add 0 to the number, like this:
nlines=`expr $nlines + 0` # get rid of trailing spaces

Iterate through list of filenames in order they were created in bash

Parsing output of ls to iterate through list of files is bad. So how should I go about iterating through list of files in order by which they were first created? I browsed several questions here on SO and they all seem to parsing ls.
The embedded link suggests:
Things get more difficult if you wanted some specific sorting that
only ls can do, such as ordering by mtime. If you want the oldest or
newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ
99 instead. If you truly need a list of all the files in a directory
in order by mtime so that you can process them in sequence, switch to
perl, and have your perl program do its own directory opening and
sorting. Then do the processing in the perl program, or -- worst case
scenario -- have the perl program spit out the filenames with NUL
delimiters.
Even better, put the modification time in the filename, in YYYYMMDD
format, so that glob order is also mtime order. Then you don't need ls
or perl or anything. (The vast majority of cases where people want the
oldest or newest file in a directory can be solved just by doing
this.)
Does that mean there is no native way of doing it in bash? I don't have the liberty to modify the filename to include the time in them. I need to schedule a script in cron that would run every 5 minutes, generate an array containing all the files in a particular directory ordered by their creation time and perform some actions on the filenames and move them to another location.
The following worked but only because I don't have funny filenames. The files are created by a server so it will never have special characters, spaces, newlines etc.
files=( $(ls -1tr) )
I can write a perl script that would do what I need but I would appreciate if someone can suggest the right way to do it in bash. Portable option would be great but solution using latest GNU utilities will not be a problem either.
sorthelper=();
for file in *; do
# We need something that can easily be sorted.
# Here, we use "<date><filename>".
# Note that this works with any special characters in filenames
sorthelper+=("$(stat -n -f "%Sm%N" -t "%Y%m%d%H%M%S" -- "$file")"); # Mac OS X only
# or
sorthelper+=("$(stat --printf "%Y %n" -- "$file")"); # Linux only
done;
sorted=();
while read -d $'\0' elem; do
# this strips away the first 14 characters (<date>)
sorted+=("${elem:14}");
done < <(printf '%s\0' "${sorthelper[#]}" | sort -z)
for file in "${sorted[#]}"; do
# do your stuff...
echo "$file";
done;
Other than sort and stat, all commands are actual native Bash commands (builtins)*. If you really want, you can implement your own sort using Bash builtins only, but I see no way of getting rid of stat.
The important parts are read -d $'\0', printf '%s\0' and sort -z. All these commands are used with their null-delimiter options, which means that any filename can be procesed safely. Also, the use of double-quotes in "$file" and "${anarray[*]}" is essential.
*Many people feel that the GNU tools are somehow part of Bash, but technically they're not. So, stat and sort are just as non-native as perl.
With all of the cautions and warnings against using ls to parse a directory notwithstanding, we have all found ourselves in this situation. If you do find yourself needing sorted directory input, then about the cleanest use of ls to feed your loop is ls -opts | read -r name; do... This will handle spaces in filenames, etc.. without requiring a reset of IFS due to the nature of read itself. Example:
ls -1rt | while read -r fname; do # where '1' is ONE not little 'L'
So do look for cleaner solutions avoiding ls, but if push comes to shove, ls -opts can be used sparingly without the sky falling or dragons plucking your eyes out.
let me add the disclaimer to keep everyone happy. If you like newlines inside your filenames -- then do not use ls to populate a loop. If you do not have newlines inside your filenames, there are no other adverse side-effects.
Contra: TLDP Bash Howto Intro:
#!/bin/bash
for i in $( ls ); do
echo item: $i
done
It appears that SO users do not know what the use of contra means -- please look it up before downvoting.
You can try using use stat command piped with sort:
stat -c '%Y %n' * | sort -t ' ' -nk1 | cut -d ' ' -f2-
Update: To deal with filename with newlines we can use %N format in stat andInstead of cut we can use awk like this:
LANG=C stat -c '%Y^A%N' *| sort -t '^A' -nk1| awk -F '^A' '{print substr($2,2,length($2)-2)}'
Use of LANG=C is needed to make sure stat uses single quotes only in quoting file names.
^A is conrtrol-A character typed using ControlVA keys together.
How about a solution with GNU find + sed + sort?
As long as there are no newlines in the file name, this should work:
find . -type f -printf '%T# %p\n' | sort -k 1nr | sed 's/^[^ ]* //'
It may be a little more work to ensure it is installed (it may already be, though), but using zsh instead of bash for this script makes a lot of sense. The filename globbing capabilities are much richer, while still using a sh-like language.
files=( *(oc) )
will create an array whose entries are all the file names in the current directory, but sorted by change time. (Use a capital O instead to reverse the sort order). This will include directories, but you can limit the match to regular files (similar to the -type f predicate to find):
files=( *(.oc) )
find is needed far less often in zsh scripts, because most of its uses are covered by the various glob flags and qualifiers available.
I've just found a way to do it with bash and ls (GNU).
Suppose you want to iterate through the filenames sorted by modification time (-t):
while read -r fname; do
fname=${fname:1:((${#fname}-2))} # remove the leading and trailing "
fname=${fname//\\\"/\"} # removed the \ before any embedded "
fname=$(echo -e "$fname") # interpret the escaped characters
file "$fname" # replace (YOU) `file` with anything
done < <(ls -At --quoting-style=c)
Explanation
Given some filenames with special characters, this is the ls output:
$ ls -A
filename with spaces .hidden_filename filename?with_a_tab filename?with_a_newline filename_"with_double_quotes"
$ ls -At --quoting-style=c
".hidden_filename" " filename with spaces " "filename_\"with_double_quotes\"" "filename\nwith_a_newline" "filename\twith_a_tab"
So you have to process a little each filename to get the actual one. Recalling:
${fname:1:((${#fname}-2))} # remove the leading and trailing "
# ".hidden_filename" -> .hidden_filename
${fname//\\\"/\"} # removed the \ before any embedded "
# filename_\"with_double_quotes\" -> filename_"with_double_quotes"
$(echo -e "$fname") # interpret the escaped characters
# filename\twith_a_tab -> filename with_a_tab
Example
$ ./script.sh
.hidden_filename: empty
filename with spaces : empty
filename_"with_double_quotes": empty
filename
with_a_newline: empty
filename with_a_tab: empty
As seen, file (or the command you want) interprets well each filename.
Each file has three timestamps:
Access time: the file was opened and read. Also known as atime.
Modification time: the file was written to. Also known as mtime.
Inode modification time: the file's status was changed, such as the file had a new hard link created, or an existing one removed; or if the file's permissions were chmod-ed, or a few other things. Also known as ctime.
Neither one represents the time the file was created, that information is not saved anywhere. At file creation time, all three timestamps are initialized, and then each one gets updated appropriately, when the file is read, or written to, or when a file's permissions are chmoded, or a hard link created or destroyed.
So, you can't really list the files according to their file creation time, because the file creation time isn't saved anywhere. The closest match would be the inode modification time.
See the descriptions of the -t, -u, -c, and -r options in the ls(1) man page for more information on how to list files in atime, mtime, or ctime order.
Here's a way using stat with an associative array.
n=0
declare -A arr
for file in *; do
# modified=$(stat -f "%m" "$file") # For use with BSD/OS X
modified=$(stat -c "%Y" "$file") # For use with GNU/Linux
# Ensure stat timestamp is unique
if [[ $modified == *"${!arr[#]}"* ]]; then
modified=${modified}.$n
((n++))
fi
arr[$modified]="$file"
done
files=()
for index in $(IFS=$'\n'; echo "${!arr[*]}" | sort -n); do
files+=("${arr[$index]}")
done
Since sort sorts lines, $(IFS=$'\n'; echo "${!arr[*]}" | sort -n) ensures the indices of the associative array get sorted by setting the field separator in the subshell to a newline.
The quoting at arr[$modified]="${file}" and files+=("${arr[$index]}") ensures that file names with caveats like a newline are preserved.

How to print the number of characters in each line of a text file

I would like to print the number of characters in each line of a text file using a unix command. I know it is simple with powershell
gc abc.txt | % {$_.length}
but I need unix command.
Use Awk.
awk '{ print length }' abc.txt
while IFS= read -r line; do echo ${#line}; done < abc.txt
It is POSIX, so it should work everywhere.
Edit: Added -r as suggested by William.
Edit: Beware of Unicode handling. Bash and zsh, with correctly set locale, will show number of codepoints, but dash will show bytes—so you have to check what your shell does. And then there many other possible definitions of length in Unicode anyway, so it depends on what you actually want.
Edit: Prefix with IFS= to avoid losing leading and trailing spaces.
Here is example using xargs:
$ xargs -d '\n' -I% sh -c 'echo % | wc -c' < file
I've tried the other answers listed above, but they are very far from decent solutions when dealing with large files -- especially once a single line's size occupies more than ~1/4 of available RAM.
Both bash and awk slurp the entire line, even though for this problem it's not needed. Bash will error out once a line is too long, even if you have enough memory.
I've implemented an extremely simple, fairly unoptimized python script that when tested with large files (~4 GB per line) doesn't slurp, and is by far a better solution than those given.
If this is time critical code for production, you can rewrite the ideas in C or perform better optimizations on the read call (instead of only reading a single byte at a time), after testing that this is indeed a bottleneck.
Code assumes newline is a linefeed character, which is a good assumption for Unix, but YMMV on Mac OS/Windows. Be sure the file ends with a linefeed to ensure the last line character count isn't overlooked.
from sys import stdin, exit
counter = 0
while True:
byte = stdin.buffer.read(1)
counter += 1
if not byte:
exit()
if byte == b'\x0a':
print(counter-1)
counter = 0
Try this:
while read line
do
echo -e |wc -m
done <abc.txt

Resources