Find filename that contain hex value? - bash

I would like to correct a bad encoding for thousand files. The error is always the same, an unknown char should be replaced with a french é.
$ find . -type f | grep 127427
./documents/1778_commande_127427_accus�_de_r�ception.pdf
$ find . -type f | grep 127427 | hexdump -C
00000000 2e 2f 64 6f 63 75 6d 65 6e 74 73 2f 31 37 37 38 |./documents/1778|
00000010 5f 63 6f 6d 6d 61 6e 64 65 5f 31 32 37 34 32 37 |_commande_127427|
00000020 5f 61 63 63 75 73 ef bf bd 5f 64 65 5f 72 ef bf |_accus..._de_r..|
00000030 bd 63 65 70 74 69 6f 6e 2e 70 64 66 0a |.ception.pdf.|
0000003d
So I am looking for ef bf bd which does not look like an unicode char. Unfortunately looking for the 0xef does not work:
$ find . -type f | grep -P '\xef'
(nothing)
Any clues?
Next I am planning to do something like:
$ find . -type f | grep <magic-here> | xargs -n1 -I{} sh -c 'mv "{}" $(echo "{}" | sed s/<magic-here>/é/) '

Like this:
echo $'\x2e\x2f\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x2f\x31\x37\x37\x38\x5f\x63\x6f\x6d\x6d\x61\x6e\x64\x65\x5f\x31\x32\x37\x34\x32\x37\x5f\x61\x63\x63\x75\x73\xef\xbf\xbd\x5f\x64\x65\x5f\x72\xef\xbf\xbd\x63\x65\x70\x74\x69\x6f\x6e\x2e\x70\x64\x66\x0a'\
| grep -Fa $'\xef\xbf\xbd'
-a treats binary files as text. -F performs a fixed string search, no regular expressions. $'' is an ANSI string
The find command should look like this:
find ... -exec sed $'s/\xef\xbf\xbd/é/g' {} +
When you are sure that it works, use -i, this will change files in place:
find ... -exec sed -i $'s/\xef\xbf\xbd/é/g' {} +

Related

bash substitution after glob not working?

I encounter a strange behaviour with bash string substitution.
I expected the same substitution on $r1 and $var to yield the exact same results.
both strings seem to have the same value.
But It is not the case and I can't understand what I am missing....
maybe is because of the glob? I just don't know... I am not pure IT guys and maybe it's something that will be evident for you.
(bottom a Repl.it link)
mkdir -p T21805
touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
r1=T21805/*R1*
echo $r1;
echo ${r1%%_S1*z}
var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${var%%_S1*z}
echo $r1| hexdump -C
echo $var | hexdump -C
output :
echo $r1
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${r1%%_S1*z}
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${var%%_S1*z}
T21805/T21805_SI-GA-D8-BH25N7DSXY
echo $r1| hexdump -C
00000000 54 32 31 38 30 35 2f 54 32 31 38 30 35 5f 53 49
|T21805/T21805_SI|
00000010 2d 47 41 2d 44 38 2d 42 48 32 35 4e 37 44 53 58
|-GA-D8-BH25N7DSX|
00000020 59 5f 53 31 5f 4c 30 30 31 5f 52 31 5f 30 30 31
|Y_S1_L001_R1_001|
00000030 2e 66 61 73 74 71 2e 67 7a 0a
|.fastq.gz.| 0000003a
echo $var | hexdump -C
00000000 54 32 31 38 30 35 2f 54 32 31 38 30 35 5f 53 49
|T21805/T21805_SI|
00000010 2d 47 41 2d 44 38 2d 42 48 32 35 4e 37 44 53 58
|-GA-D8-BH25N7DSX|
00000020 59 5f 53 31 5f 4c 30 30 31 5f 52 31 5f 30 30 31
|Y_S1_L001_R1_001|
00000030 2e 66 61 73 74 71 2e 67 7a 0a
|.fastq.gz.| 0000003a
Repl.it
I am interested on understanding why this is not working, I can achieve my desire output using sed for example.
Glob expansion doesn't happen at assignment time.
$ mkdir -p T21805
$ touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
$ touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
$ r1=T21805/*R1*
$ printf '%s\n' "$r1"
T21805/*R1*
$ printf '%s\n' $r1
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
It happens after the unquoted r1 has been expanded. When you write ${r1%%_S1*z}, the value of r1 doesn't contain the string S1; only after ${r1} expands is there an S1 you could match against.
If you set an array, the assignment rules are different. The glob expands before the assignment, and so you can do your filtering on each element of the array.
$ r1=( T21805/*R1* )
$ printf '%2\n' "${r1[#]}"
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
$ printf '%s\n' "${r1[#]%%_S1*z}"
T21805/T21805_SI-GA-D8-BH25N7DSXY
T21805/T21805_SI-GA-D8-BH25N7DSXY
I ran it after set -xv to see the contents of r1.
$ r1=T21805/*R1*
+ r1='T21805/*R1*'
$ var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
+ var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
The r1 of$ {r1 %% _ S1 * z}isT21805 / * R1 *.
r1 does not include_S1 * z.

Append to a string in bash

I'm trying to get a download URL using curl and awk and want to append something to that URL afterwards.
Here some snipped of my code:
IMAGE=$(curl -I -s https://downloads.raspberrypi.org/raspbian_lite_latest | awk '/Location/ {print $2}')
CHECKSUM="$IMAGE.sha256"
echo $IMAGE
echo $CHECKSUM
What I'm getting is that it is somehow replacing parts at the beginning.
https://downloads.raspberrypi.org/raspbian_lite/images/raspbian_lite-2018-11-15/2018-11-13-raspbian-stretch-lite.zip
.sha256/downloads.raspberrypi.org/raspbian_lite/images/raspbian_lite-2018-11-15/2018-11-13-raspbian-stretch-lite.zip
I'm a bit helpless, because the following works as expected:
A="https""://abc.org/a_b/a.zip" # looks weird, but full URLs are not allowed here
B="$A.sha256"
echo $B
What am I doing wrong?
When you hexdump your string, you see that is uses windows line endings (with carriage return):
echo $IMAGE | hexdump -C
00000000 68 74 74 70 73 3a 2f 2f 64 6f 77 6e 6c 6f 61 64 |https://download|
00000010 73 2e 72 61 73 70 62 65 72 72 79 70 69 2e 6f 72 |s.raspberrypi.or|
00000020 67 2f 72 61 73 70 62 69 61 6e 5f 6c 69 74 65 2f |g/raspbian_lite/|
00000030 69 6d 61 67 65 73 2f 72 61 73 70 62 69 61 6e 5f |images/raspbian_|
00000040 6c 69 74 65 2d 32 30 31 38 2d 31 31 2d 31 35 2f |lite-2018-11-15/|
00000050 32 30 31 38 2d 31 31 2d 31 33 2d 72 61 73 70 62 |2018-11-13-raspb|
00000060 69 61 6e 2d 73 74 72 65 74 63 68 2d 6c 69 74 65 |ian-stretch-lite|
00000070 2e 7a 69 70 0d 0a |.zip..|
00000076
To fix that, use
IMAGE=$(curl -I -s https://downloads.raspberrypi.org/raspbian_lite_latest | awk '/Location/ {print $2}' | tr -d "\r")
The problem apparently is, that your $IMAGE contains / ends in a trailing '\r(carriage return). So you've actually appended ".sha256" as you expected to"something\r.sha256" which when being echoed means.... something, cursor back to the beginning of the line, .sha256. Long story short, strip that '\r`. E.g:
IMAGE=$(curl -I -s https://downloads.raspberrypi.org/raspbian_lite_latest | awk '/Location/ {sub(/\r$/, "", $2); print $2}')
Since you are using bash you can use substring replacement, ie. replace the \r in IMAGEvar:
$ CHECKSUM="${IMAGE/$'\r'/}.sha256"
$ echo $CHECKSUM
https://downloads.raspberrypi.org/raspbian_lite/images/raspbian_lite-2018-11-15/2018-11-13-raspbian-stretch-lite.zip.sha256
or prepare for it in the awk part by setting the record separator RS:
... | awk -v RS="\r?\n" '/Location/ {print $2}'
Tested with gawk, mawk and original-awk. Surprisingly busybox awk removed it by itself:
$ echo -e \\r | busybox awk '{print $1}' | hexdump -C
00000000 0a |.|
but for example:
$ echo -e \\r | gawk '{print $1}' | hexdump -C
00000000 0d 0a |..|

New to awk and sed, How could I improve this? Multiple sed and awk commands

This is the script I've constructed
It takes a list of files according to the extension supplied as an argument.
It then removes everything before the pattern 00000000: in those files.
The pattern 00000000: is preceded by the string <pre>, it then removes those five first characters.
The script then removes the last three lines of the file
The script the outputs only the hexdump data of the file.
The script runs xxd to convert the hexdump to a file.jpg
if [[ $# -eq 0 ]] ; then
echo 'Run script as ./hexconv ext'
exit 0
fi
for file in *.$1
do
filename=$(basename $file)
extension="${filename##*.}"
filename="${filename%.*}"
sed -n '/00000000:/,$p' $file | sed '1s/^.....//' | head -n -3 | awk '{print $2" "$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10" "$11" "$12" "$13" "$14" "$15" "$16" "$17}' | xxd -p -r > $filename.jpg
done
It works as I want it too, but I suspect there are things to improve it by, but alas, I am a novice in the use of awk and sed.
Excerpt from file
<th>response-head:</th>
<td>HTTP/1.1 200 OK
Date: Sun, 15 Dec 2013 04:27:04 GMT
Server: PWS/8.0.18
X-Px: ms h0-s34.p6-lhr ( h0-s35.p6-lhr), ht-d h0-s35.p6-lhr.cdngp.net
Etag: "4556354-9fbf8-4e40387aadfc0"
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0, max-age=0
Accept-Ranges: bytes
Content-Length: 654328
Content-Type: image/jpeg
Last-Modified: Thu, 15 Aug 2013 21:55:19 GMT
Pragma: no-cache
</td>
</tr>
</table>
<hr/>
<pre>00000000: ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48 ......JFIF.....H
00000010: 00 48 00 00 ff e1 00 18 45 78 69 66 00 00 49 49 .H......Exif..II
00000020: 2a 00 08 00 00 00 00 00 00 00 00 00 00 00 ff ed *...............
00000030: 00 48 50 68 74 73 68 70 20 33 2e 30 00 .HPhotoshop 3.0.
00000040: 38 42 49 4d 04 04 00 00 00 00 00 1c 01 5a 00 8BIM..........Z.
00000050: 03 1b 25 47 1c 02 00 00 02 00 02 00 38 42 49 4d ..%G........8BIM
00000060: 04 25 00 00 00 00 00 10 fc e1 89 c8 b7 c9 78 .%.............x
00000070: 34 62 34 07 58 77 eb ff e1 03 a5 68 74 74 70 /4b4.Xw.....http
00000080: 3a 6e 73 2e 61 64 62 65 2e 63 6d ://ns.adobe.com/
00000090: 78 61 70 31 2e 30 00 3c 78 70 61 63 6b xap/1.0/.<?xpack
000000a0: 65 74 20 62 65 67 69 6e 3d 22 ef bb bf 22 20 69 et begin="..." i
000000b0: 64 3d 22 57 35 4d 30 4d 70 43 65 68 69 48 7a 72 d="W5M0MpCehiHzr
000000c0: 65 53 7a 4e 54 63 7a 6b 63 39 64 22 3e 20 3c eSzNTczkc9d"?> <
000000d0: 78 3a 78 6d 70 6d 65 74 61 20 78 6d 6c 6e 73 3a x:xmpmeta xmlns:
000000e0: 78 3d 22 61 64 62 65 3a 6e 73 3a 6d 65 74 61 x="adobe:ns:meta
000000f0: 22 20 78 3a 78 6d 70 74 6b 3d 22 41 64 62 /" x:xmptk="Adob
00000100: 65 20 58 4d 50 20 43 72 65 20 35 2e 30 2d 63 e XMP Core 5.0-c
00000110: 30 36 31 20 36 34 2e 31 34 30 39 34 39 2c 20 32 061 64.140949, 2
00000120: 30 31 30 31 32 30 37 2d 31 30 3a 35 37 3a 010/12/07-10:57:
Although #CodeGnome is right and this might belong to Code Review SE, here you go anyway:
Slightly more efficient to combine the multiple sed commands into one, for example:
sed -n -e 's/^<pre>//' -e '/00000000:/,$p'
I decided to retract this part, as I'm not all that sure it's any better or clearer. Your version is fine, except that s/^<pre>// is better than s/^.....//.
Use exit 1 when checking the number of arguments to signal an error
What is for file in *. there? Iterate for all files ending with a dot? Typo?
Unless you're 100% sure the filenames will never contain spaces, you should quote them, but don't quote where you don't need, for example:
filename=$(basename "$file") # need to quote
extension=${filename##*.} # no need,
filename=${filename%.*} # no need
sed ... "$file" # need to quote
... | xxd > "$filename".jpg # need to quote
The last awk could be shorter and less error prone as a loop:
... | awk '{printf $2; for (i=3; i<=17; ++i) printf " " $i; print ""}'
It seems you want to learn. You might be interested in this other answer too: What are the rules to write robust shell scripts?
The error message should be sent to stderr, should not hard-code the name of the script in case you rename it later, and should exit with a nonzero value.
if (( ! $# )); then
echo >&2 "Run script as '$0' \$extension"
exit 1
fi
If you're going to put the then on the same line as the if, then you should put the do on the same line as the for, too, for consistency:
for file in *.$1; do
Using file for the full name and filename for the basename is confusing variable name choice. I would use basename for the variable, to match the operation. And you need to quote the parameter expansion:
basename=$(basename "$file")
But you don't need to quote the right hand side of an assignment:
extension=${basename##*.}
The part of a filename without the extension is sometimes called the root (in vi and csh :-modifiers, you get it with :r)... using that name would be less confusing than changing an existing variable and reusing it:
root=${basename%.*}
As far as the actual pipeline, I would reorder it to put the head before the awk, since the sed and the head are all about what lines to print out and should be grouped together before the awk which modifies those selected lines. I would also use a loop and printf to make the awk a little more wieldy:
sed -n '/0\{8\}:/,$p' "$file" |
head -n -3 |
awk '{ printf "%s", $2; for (f=3;f<=17;++f) { printf " %s", $f }; print "" }' |
xxd -p -r > "$root.jpg"
done

Append to end of each line with sed on Windows?

I am using sed on Windows (the GNU port).
I execute:
$> sed "s/$/./" < /data.txt
And get:
.ne
.wo
.hree
But expect.
one.
two.
three.
The following works though I don't think it should. The way I read it is "replace the last character of the line with a period." I'm afraid it won't work consistently when used elsewhere. The intent isn't to replace the last character with a period but to append a period.
$> sed "s/.$/./" < /data.txt
I am not sure if the file encoding or something specific to windows is causing the issues I'm having or if it's just lack of experience with sed. Ideas?
hexdump -C sheds some light:
$ sed 's/$/./' < t.dos | hexdump -C
00000000 6f 6e 65 0d 2e 0a 74 77 6f 0d 2e 0a 74 68 72 65 |one...two...thre|
00000010 65 0d 2e 0a |e...|
00000014
There, 2e is the dot, the 0d before it is carriage return aka \r, and after that is the newline aka \n. In other words, sed treats \r as the end of the line instead of \r\n together, and thus \r is still part of the line, so it puts the dot after it, then adds back the newline as usual.
I think this does what you want, but it's not exactly pretty:
$ sed 's/.$/.\r/' < t.dos | hexdump -C
00000000 6f 6e 65 2e 0d 0a 74 77 6f 2e 0d 0a 74 68 72 65 |one...two...thre|
00000010 65 2e 0d 0a |e...|
00000014
The above is not so good, because it will only work if the input is in dos format, otherwise it will break the file. A better solution might be to first strip any \r and add them back manually later, like this:
$ tr -d '\r' < t.dos | sed -e 's/$/.\r/' | hexdump -C
00000000 6f 6e 65 2e 0d 0a 74 77 6f 2e 0d 0a 74 68 72 65 |one...two...thre|
00000010 65 2e 0d 0a |e...|
00000014

How to split on NULs in shell [duplicate]

This question already has answers here:
Capturing output of find . -print0 into a bash array
(13 answers)
Closed 8 years ago.
I am using zsh as a shell.
I would like to execute the unix find command and put the result into a shell array variable, something like:
FILES=($(find . -name '*.bak'))
so that I can iterate over the values with something like
for F in "$FILES[#]"; do echo "<<$F>>"; done
However, my filenames contain spaces at least, and perhaps other funky characters, so the above doesn't work. What does work is:
IFS=$(echo -n -e "\0"); FILES=($(find . -name '*.bak' -print0)); unset IFS
but that's fugly. This is already a bit beyond my comfort limit with zsh syntax, so I'm hoping someone can point me to some basic feature that I never knew about but should.
I tend to use read for that. A quick google search showed me zsh also seem to support that:
find . -name '*.bak' | while read file; do echo "<<$file>>"; done
That doesn't split with zero bytes, but it will make it work with file-names containing whitespace other than newlines. If the file-name appears at the very last of the command to be executed, you can use xargs, working also with newlines in filenames:
find . -name '*.bak' -print0 | xargs -0 cp -t /tmp/dst
copies all files found into the directory /tmp/dst. Downside of the xargs approach is that you don't have the filenames in a variable, of course. So this not always applicable.
The only way I figured out is using eval:
(zyx:~/tmp) % F="$(find . -maxdepth 1 -name '* *' -print0)"
(zyx:~/tmp) % echo $F | hexdump -C
00000000 2e 2f 20 10 30 00 2e 2f d0 96 d1 83 d1 80 d0 bd |./ .0../........|
00000010 d0 b0 d0 bb 20 c2 ab d0 a1 d0 b0 d0 bc d0 b8 d0 |.... ...........|
00000020 b7 d0 b4 d0 b0 d1 82 c2 bb 2e d0 9c d0 b8 d1 82 |................|
00000030 d1 8e d0 b3 d0 b8 d0 bd d0 b0 20 d0 9e d0 bb d1 |.......... .....|
00000040 8c d0 b3 d0 b0 2e 20 d0 93 d1 80 d0 b0 d0 bd d0 |...... .........|
00000050 b8 20 d0 be d1 82 d1 80 d0 b0 d0 b6 d0 b5 d0 bd |. ..............|
00000060 d0 b8 d0 b9 2e 68 74 6d 6c 00 2e 2f 0a 20 0a 6e |.....html../. .n|
00000070 00 2e 2f 20 7b 5b 5d 7d 20 28 29 21 26 7c 00 2e |../ {[]} ()!&|..|
00000080 2f 74 65 73 74 32 20 2e 00 2e 2f 74 65 73 74 33 |/test2 .../test3|
00000090 20 2e 00 2e 2f 74 74 74 0a 74 74 0a 0a 74 20 00 | .../ttt.tt..t .|
000000a0 2e 2f 74 65 73 74 20 2e 00 2e 2f 74 74 0a 74 74 |./test .../tt.tt|
000000b0 0a 74 20 00 2e 2f 74 74 5c 20 74 0a 74 74 74 00 |.t ../tt\ t.ttt.|
000000c0 2e 2f 0a 20 0a 0a 00 2e 2f 7b 5c 5b 5d 7d 20 28 |./. ..../{\[]} (|
000000d0 29 21 26 7c 00 0a |)!&|..|
000000d6
(zyx:~/tmp) % echo $F[1]
.
(zyx:~/tmp) % eval 'F=( ${(s.'$'\0''.)F} )'
(zyx:~/tmp) % echo $F[1]
./ 0
${(s.\0.)...} and ${(s.$'\0'.)...} do not work.
You can use function:
function SplitAt0()
{
local -r VAR=$1
shift
local -a CMD
set -A CMD $#
local -r CMDRESULT="$($CMD)"
eval "$VAR="'( ${(s.'$'\0''.)CMDRESULT} )'
}
Usage: SplitAt0 varname command [arguments]
I should have used ${(ps.\0.)F}, not ${(s.\0.)F}:
% F=${(ps.\0.)"$(find . -maxdepth 1 -name '* *' -print0)"}
% echo $F[1]
./ 0
Alternatively, since you're using zsh, you can use zsh's extended globbing syntax to find files rather than using the find command. As far as I know, all the functionality of find is present in the globbing syntax, and it handles filenames with whitespaces properly. See the zshexpn(1) manpage for more info. If you use zsh on a fulltime basis, it's well worth learning the syntax.
I've tried
F=(*)
and it handles even the files with newlines in the filename.

Resources