Recursively find hexadecimal bytes in binary files - bash

I'm using grep within bash shell to find a series of hexadecimal bytes in files:
$ find . -type f -exec grep -ri "\x5B\x27\x21\x3D\xE9" {} \;
The search works fine, although I know there's a limitation for matches when not using the -a option where results only return:
Binary file ./file_with_bytes matches
I would like to get the offset of the matching result, is this possible? I'm open to using another similar tool I'm just not sure what it would be.

There is actually an option in grep that is available to use
-b --byte-offset Print the 0-based byte offset within the input file
A simple example using this option:
$ grep -obarUP "\x01\x02\x03" /bin
prints out both the filename and byte offset of the matched pattern inside a directory
/bin/bash:772067:
/bin/bash:772099:
/bin/bash:772133:
/bin/bash:772608:
/bin/date:56160:
notice that find is actually not needed since the option -r has already taken care of the recursive file searching

Not at a computer, but use:
od -x yourFile
or
xxd yourFile
to get it dumped in hex with offsets on the left side.
Sometimes your search string may not be found because the characters do not appear contiguously but are split across two lines. You can pass the file through twice though, with the first 4 bytes chopped off the second time to make sure your string is found intact on one pass or the other. Then add the offest back on and sort and uniq the offsets.

Related

Save next 32 bytes in hex (after search)

I am searching all files on my drive for a given hexadecimal value, after it is found I need to copy and save the next 32 bytes after the found occurrence (there may be many occurrences in one file).
Right now I'm searching for files like this:
ggrep -obaRUP "\x01\x02\x03\x04" . > outputfile.txt
But this script retuns only file path. Preferably I'd like to use only standard Linux / Mac tools.
With -P (--perl-regexes) you can use the \K escape sequence to clear the matching buffer. Then match .{32} more chars(!):
LANG=C grep -obaRUP "\x01\x02\x03\x04\K.{32,32}" . > output.file
Note:
I'm using LANG=C to enforce a locale which is using single byte encoding, not utf-8. This is to make sure .{32} would not accidentally match unicode chars(!), but bytes instead.
The -P option is only supported by GNU grep (along with a few others used in your example)
You may want to open the output.file in a hex editor to actually see characters. For example hexdump, hd or xxd could be used.
Note, the above command will additionally print the filename and the line number / byte offset of the match. This is implicitly caused by using grep -R (recursive).
To get only the 32 bytes in the output, and nothing else, I suggest to use find:
find . -type f -exec grep -oaUP '\x01\x02\x03\x04\K.{32}' {} \;
My test was a little simple, but this worked for me.
$: IFS=: read -r file offset data <<< "$(grep -obaRUP "\x01\x02\x03\x04.{32}" .)"
$: echo "$file # $((offset+4)):[${data#????}]"
./x # 10:[HERE ARE THE THIRTY-TWO BYTES !!]
Rather than do a complicated look-behind, I just gabbed the ^A^B^C^D and the next 32 bytes, and stripped off the leading 4 bytes from the field.
#hek2mgl's \K makes all that unnecessary, though. Use -h to eliminate filenames.
$: grep -obahRUP "\x01\x02\x03\x04\K.{32}" .
10:HERE ARE THE THIRTY-TWO BYTES !!
Take out the -b if you don't want the offset.
$: grep -oahRUP "\x01\x02\x03\x04\K.{32}" .
HERE ARE THE THIRTY-TWO BYTES !!

How can I iterate from a list of source files and locate those files on my disk drive? I'm using FD and RIPGREP

I have a very long list of files stored in a text file (missing-files.txt) that I want to locate on my drive. These files are scattered in different folders in my drive. I want to get whatever closest available that can be found.
missing-files.txt
wp-content/uploads/2019/07/apple.jpg
wp-content/uploads/2019/08/apricots.jpg
wp-content/uploads/2019/10/avocado.jpg
wp-content/uploads/2020/04/banana.jpg
wp-content/uploads/2020/07/blackberries.jpg
wp-content/uploads/2020/08/blackcurrant.jpg
wp-content/uploads/2021/06/blueberries.jpg
wp-content/uploads/2021/01/breadfruit.jpg
wp-content/uploads/2021/02/cantaloupe.jpg
wp-content/uploads/2021/03/carambola.jpg
....
Here's my working bash code:
while read p;
do
file="${p##*/}"
/usr/local/bin/fd "${file}" | /usr/local/bin/rg "${p}" | /usr/bin/head -n 1 >> collected-results.txt
done <missing-files.txt
What's happening in my bash code:
I iterate from my list of files
I use FD (https://github.com/sharkdp/fd) command to locate those files in my drive
I then piped it to RIPGREP (https://github.com/BurntSushi/ripgrep) to filter the results and find the closest match. The match I'm looking for should match the same file and folder structure. I only limit it to one result.
Then finally stored it on another text file where I can later then evaluate the lists for next step
Where I need help:
Is this the most effecient way to do this? I have over 2,000 files that I need to locate. I'm open to other solution, this is something I just divised.
For some reason my coded broke, It stopped returning results to "collected-results.txt". My guess is that it broke somewhere in the second pipe right after the FD command. I haven't setup any condition in case it encounters an error or it can't find the file so it's hard for me to determine.
Additional Information:
I'm using Mac, and running on Catalina
Clearly this is not my area of expertise
"Missing" sounds like they do not exist where expected.
What makes you think they would be somewhere else?
If they are, I'd put the filenames in a list.txt file with enough minimal pattern to pick them out of the output of find.
$: cat list.txt
/apple.jpg$
/apricots.jpg$
/avocado.jpg$
/banana.jpg$
/blackberries.jpg$
/blackcurrant.jpg$
/blueberries.jpg$
/breadfruit.jpg$
/cantaloupe.jpg$
/carambola.jpg$
Then search the whole machine, which is gonna take a bit...
$: find / | grep -f list.txt
/tmp/apricots.jpg
/tmp/blackberries.jpg
/tmp/breadfruit.jpg
/tmp/carambola.jpg
Or if you want those longer partial paths,
$: find / | grep -f missing-files.txt
That should show you the actual paths to wherever those files exist IF they do exist on the system.
From the way I understand it, you want to find all files what could match the directory structure:
path/to/file
So it should return something like "/full/path/to/file" and "/another/full/path/to/file"
Using a simple find command you can get a list of all files that match this criteria.
Using find you can search your hard disk in a single go with something of the form:
$ find -regex pattern
The idea is now to build pattern, which we can do from the file missing_files.txt. The pattern should look something like .*/\(file1\|file2\|...\|filen\). So we can use the following awk to do so:
$ sed ':a;N;$!ba;s/\n/\|/g' missing_files.txt
So now we can do exactly what you did, but a bit quicker, in the following way:
pattern="$(sed ':a;N;$!ba;s/\n/\|/g' missing_files.txt)"
pattern=".*/\($pattern\)"
find -regex "$pattern" > file_list.txt
In order to find the files, you can now do something like:
grep -F -f missing_files file_list.txt
This will return all the matching cases. If you just want the first case, i.e.
awk '(NR==FNR){a[$0]++;next}{for(i in a) if (!(i in b)) if ($0 ~ i) {print; b[i]}}' missing_files file_list.txt
Is this the most effecient way to do this?
I/O is mostly usually the biggest bottleneck. You are running some software fd to find the files for one file one at a time. Instead, run it to find all files at once - do single I/O for all files. In shell you would do:
find . -type f '(' -name "first name" -o -name "other name" -o .... ')'
How can I iterate from a list of source files and locate those files on my disk drive?
Use -path to match the full path. First build the arguments then call find.
findargs=()
# Read bashfaq/001
while IFS= read -r patt; do
# I think */ should match anything in front.
findargs+=(-o -path "*/$patt")
done < <(
# TODO: escape glob better, not tested
# see https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13
sed 's/[?*[]/\\&/g' missing-files.txt
)
# remove leading -o
unset findargs[0]
find / -type f '(' "${findargs[#]}" ')'
Topics to research: var=() - bash arrays, < <(...) shell redirection with process substitution and when to use it (bashfaq/024), glob (and see man 7 glob) and man find.

Iterate through list of filenames in order they were created in bash

Parsing output of ls to iterate through list of files is bad. So how should I go about iterating through list of files in order by which they were first created? I browsed several questions here on SO and they all seem to parsing ls.
The embedded link suggests:
Things get more difficult if you wanted some specific sorting that
only ls can do, such as ordering by mtime. If you want the oldest or
newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ
99 instead. If you truly need a list of all the files in a directory
in order by mtime so that you can process them in sequence, switch to
perl, and have your perl program do its own directory opening and
sorting. Then do the processing in the perl program, or -- worst case
scenario -- have the perl program spit out the filenames with NUL
delimiters.
Even better, put the modification time in the filename, in YYYYMMDD
format, so that glob order is also mtime order. Then you don't need ls
or perl or anything. (The vast majority of cases where people want the
oldest or newest file in a directory can be solved just by doing
this.)
Does that mean there is no native way of doing it in bash? I don't have the liberty to modify the filename to include the time in them. I need to schedule a script in cron that would run every 5 minutes, generate an array containing all the files in a particular directory ordered by their creation time and perform some actions on the filenames and move them to another location.
The following worked but only because I don't have funny filenames. The files are created by a server so it will never have special characters, spaces, newlines etc.
files=( $(ls -1tr) )
I can write a perl script that would do what I need but I would appreciate if someone can suggest the right way to do it in bash. Portable option would be great but solution using latest GNU utilities will not be a problem either.
sorthelper=();
for file in *; do
# We need something that can easily be sorted.
# Here, we use "<date><filename>".
# Note that this works with any special characters in filenames
sorthelper+=("$(stat -n -f "%Sm%N" -t "%Y%m%d%H%M%S" -- "$file")"); # Mac OS X only
# or
sorthelper+=("$(stat --printf "%Y %n" -- "$file")"); # Linux only
done;
sorted=();
while read -d $'\0' elem; do
# this strips away the first 14 characters (<date>)
sorted+=("${elem:14}");
done < <(printf '%s\0' "${sorthelper[#]}" | sort -z)
for file in "${sorted[#]}"; do
# do your stuff...
echo "$file";
done;
Other than sort and stat, all commands are actual native Bash commands (builtins)*. If you really want, you can implement your own sort using Bash builtins only, but I see no way of getting rid of stat.
The important parts are read -d $'\0', printf '%s\0' and sort -z. All these commands are used with their null-delimiter options, which means that any filename can be procesed safely. Also, the use of double-quotes in "$file" and "${anarray[*]}" is essential.
*Many people feel that the GNU tools are somehow part of Bash, but technically they're not. So, stat and sort are just as non-native as perl.
With all of the cautions and warnings against using ls to parse a directory notwithstanding, we have all found ourselves in this situation. If you do find yourself needing sorted directory input, then about the cleanest use of ls to feed your loop is ls -opts | read -r name; do... This will handle spaces in filenames, etc.. without requiring a reset of IFS due to the nature of read itself. Example:
ls -1rt | while read -r fname; do # where '1' is ONE not little 'L'
So do look for cleaner solutions avoiding ls, but if push comes to shove, ls -opts can be used sparingly without the sky falling or dragons plucking your eyes out.
let me add the disclaimer to keep everyone happy. If you like newlines inside your filenames -- then do not use ls to populate a loop. If you do not have newlines inside your filenames, there are no other adverse side-effects.
Contra: TLDP Bash Howto Intro:
#!/bin/bash
for i in $( ls ); do
echo item: $i
done
It appears that SO users do not know what the use of contra means -- please look it up before downvoting.
You can try using use stat command piped with sort:
stat -c '%Y %n' * | sort -t ' ' -nk1 | cut -d ' ' -f2-
Update: To deal with filename with newlines we can use %N format in stat andInstead of cut we can use awk like this:
LANG=C stat -c '%Y^A%N' *| sort -t '^A' -nk1| awk -F '^A' '{print substr($2,2,length($2)-2)}'
Use of LANG=C is needed to make sure stat uses single quotes only in quoting file names.
^A is conrtrol-A character typed using ControlVA keys together.
How about a solution with GNU find + sed + sort?
As long as there are no newlines in the file name, this should work:
find . -type f -printf '%T# %p\n' | sort -k 1nr | sed 's/^[^ ]* //'
It may be a little more work to ensure it is installed (it may already be, though), but using zsh instead of bash for this script makes a lot of sense. The filename globbing capabilities are much richer, while still using a sh-like language.
files=( *(oc) )
will create an array whose entries are all the file names in the current directory, but sorted by change time. (Use a capital O instead to reverse the sort order). This will include directories, but you can limit the match to regular files (similar to the -type f predicate to find):
files=( *(.oc) )
find is needed far less often in zsh scripts, because most of its uses are covered by the various glob flags and qualifiers available.
I've just found a way to do it with bash and ls (GNU).
Suppose you want to iterate through the filenames sorted by modification time (-t):
while read -r fname; do
fname=${fname:1:((${#fname}-2))} # remove the leading and trailing "
fname=${fname//\\\"/\"} # removed the \ before any embedded "
fname=$(echo -e "$fname") # interpret the escaped characters
file "$fname" # replace (YOU) `file` with anything
done < <(ls -At --quoting-style=c)
Explanation
Given some filenames with special characters, this is the ls output:
$ ls -A
filename with spaces .hidden_filename filename?with_a_tab filename?with_a_newline filename_"with_double_quotes"
$ ls -At --quoting-style=c
".hidden_filename" " filename with spaces " "filename_\"with_double_quotes\"" "filename\nwith_a_newline" "filename\twith_a_tab"
So you have to process a little each filename to get the actual one. Recalling:
${fname:1:((${#fname}-2))} # remove the leading and trailing "
# ".hidden_filename" -> .hidden_filename
${fname//\\\"/\"} # removed the \ before any embedded "
# filename_\"with_double_quotes\" -> filename_"with_double_quotes"
$(echo -e "$fname") # interpret the escaped characters
# filename\twith_a_tab -> filename with_a_tab
Example
$ ./script.sh
.hidden_filename: empty
filename with spaces : empty
filename_"with_double_quotes": empty
filename
with_a_newline: empty
filename with_a_tab: empty
As seen, file (or the command you want) interprets well each filename.
Each file has three timestamps:
Access time: the file was opened and read. Also known as atime.
Modification time: the file was written to. Also known as mtime.
Inode modification time: the file's status was changed, such as the file had a new hard link created, or an existing one removed; or if the file's permissions were chmod-ed, or a few other things. Also known as ctime.
Neither one represents the time the file was created, that information is not saved anywhere. At file creation time, all three timestamps are initialized, and then each one gets updated appropriately, when the file is read, or written to, or when a file's permissions are chmoded, or a hard link created or destroyed.
So, you can't really list the files according to their file creation time, because the file creation time isn't saved anywhere. The closest match would be the inode modification time.
See the descriptions of the -t, -u, -c, and -r options in the ls(1) man page for more information on how to list files in atime, mtime, or ctime order.
Here's a way using stat with an associative array.
n=0
declare -A arr
for file in *; do
# modified=$(stat -f "%m" "$file") # For use with BSD/OS X
modified=$(stat -c "%Y" "$file") # For use with GNU/Linux
# Ensure stat timestamp is unique
if [[ $modified == *"${!arr[#]}"* ]]; then
modified=${modified}.$n
((n++))
fi
arr[$modified]="$file"
done
files=()
for index in $(IFS=$'\n'; echo "${!arr[*]}" | sort -n); do
files+=("${arr[$index]}")
done
Since sort sorts lines, $(IFS=$'\n'; echo "${!arr[*]}" | sort -n) ensures the indices of the associative array get sorted by setting the field separator in the subshell to a newline.
The quoting at arr[$modified]="${file}" and files+=("${arr[$index]}") ensures that file names with caveats like a newline are preserved.

Remove files base on integer.name

Ok what I am trying to do is very specific. I need some code that will remove files from a directory based on integer.name.
The files in the directory are listed like this
441.TERM (the # is actually a PID so it'll be random)
442.TERM
No matter what I always want to keep the first .TERM file & remove any .TERM file after that as no more than one should ever be created by my script, but it does happen sometimes due to some issues with the system I am scripting on. I only want it to effect my 000.TERM files any other files it finds in the directory can stay. So if directory contain any .TERM file an with an integer higher than the first one found then remove the .TERM files with higher integers.
PS. .TERM is not an extension just in case there is any confusion.
find /your/path -name "*.TERM" | sort -t. -k1 -n | tail -n +2 | xargs -r rm
Let's break it down:
find /your/path -name "*.TERM" will output a list of all .TERM files.
You could also use ls /your/path/*.TERM, but you may find the output unpredictable. (Example: your implementation may have -F on by default, which would cause every socket to end in a = in the list.)
sort sorts them by the first field (-k1) using a period as a separator (-t.). -n guarantees a numeric sort (such that 5 comes before 06).
tail -n +2 skips the first line and returns the rest
xargs rm sends every output line to an rm command, removing them. -r skips running the rm if there's no output piped in, but is listed as a GNU extension.
The script as above is fairly robust for your needs, but may fail if you have so many files in the directory that they don't fit on one command line, and might get you into trouble if any matching filenames somehow contain a newline.

Count files matching a pattern with GREP

I am on a windows server, and have installed GREP for win. I need to count the number of file names that match (or do not match) a specific pattern. I don't really need all the filenames listed out, I just need a total count of how many matched. The tree structure that I will be searching is fairly large, so I'd like to conserve as much processing as possible.
I'm not very familiar with grep, but it looks like i can use the -l option to search for file names matching a given pattern. So, for example, I could use
$grep -l -r this *.doc*
to search for all MS word files in the current folder and all child folders. This would then return to me a listing of all those files. i don't want the listing, i just want a count of how many it found. Is this possible with GREP...or another tool?
thanks!
On linux you would use
grep -l -r this .doc | wc -l
to get the number of printed lines
Although -r .doc does not search all word files, you would use --include "*doc" .
And if you do not have wc, you can use grep again, to count the number of matches:
grep -l -r --include "*doc" this . | grep -c .

Resources