Script which will move non-ASCII files - bash

I need help. I should write a script,whih will move all non-ASCII files from one directory to another. I got this code,but i dont know why it is not working.
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if [ -eq "$( echo "$(file $file)" | grep -nP '[\x80-\xFF]' )" ];
then
if test -e "$1"; then
mv $file $1
fi
fi
done
exit 0

It's not clear which one you are after, but:
• To test if the variable $file contains a non-ASCII character, you can do:
if [[ $file == *[^[:ascii:]]* ]]; then
• To test if the file $file contains a non-ASCII character, you can do:
if grep -qP '[^[:ascii:]]' "$file"; then
So for example your code would look like:
for file in "/some/path"/*; do
if grep -qP '[^[:ascii:]]' "$file"; then
test -d "$1" && mv "$file" "$1"
fi
done

The first problem is that your first if statement has an invalid test clause. The -eq operator of [ needs to take one argument before and one after; your before argument is gone or empty.
The second problem is that I think the echo is redundant.
The third problem is that the file command always has ASCII output but you're checking for binary output, which you'll never see.
Using file pretty smart for this application, although there are two ways you can go on this; file says a variety of things and what you're interested in are data and ASCII, but not all files that don't identify as data are ASCII and not all files that don't identify as ASCII are data. You might be better off going with the original idea of using grep, unless you need to support Unicode files. Your grep is a bit strange to me so I don't know what your environment is but I might try this:
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if grep -qP '[\0x80-\0xFF]' $file; then
[ -e "$1" ] && mv $file $1
fi
done
The -q option means be quiet, only return a return code, don't show the matches. (It might be -s in your grep.) The return code is tested directly by the if statement (no need to use [ or test). The && in the next line is just a quick way of saying if the left-hand side is true, then execute the right-hand side. You could also form this as an if statement if you find that clearer. [ is a synonym for test. Personally if $1 is a directory and doesn't change, I'd check it once at the beginning of the script instead of on each file, it would be faster.

If you mean you want to know if something is not a plain text file then you can use the file command which returns information about the type of a file.
[[ ! $( file -b "$file" ) =~ (^| )text($| ) ]]
The -b simply tells it not to bother returning the filename.
The returned value will be something like:
ASCII text
HTML document text
POSIX shell script text executable
PNG image data, 21 x 34, 8-bit/color RGBA, non-interlaced
gzip compressed data, from Unix, last modified: Mon Oct 31 14:29:59 2016
The regular expression will check whether the returned file information includes the word "text" that is included for all plain text file types.
You can instead filter for specific file types like "ASCII text" if that is all you need.

Related

Bash script MV is disappearing files

I've written a script to go through all the files in the directory the script is located in, identify if a file name contains a certain string and then modify the filename. When I run this script, the files that are supposed to be modified are disappearing. It appears my usage of the mv command is incorrect and the files are likely going to an unknown directory.
#!/bin/bash
string_contains="dummy_axial_y_position"
string_dontwant="dummy_axial_y_position_time"
file_extension=".csv"
for FILE in *
do
if [[ "$FILE" == *"$string_contains"* ]];then
if [[ "$FILE" != *"$string_dontwant"* ]];then
filename= echo $FILE | head -c 15
combined_name="$filename$file_extension"
echo $combined_name
mv $FILE $combined_name
echo $FILE
fi
fi
done
I've done my best to go through the possible errors I've made in the MV command but I haven't had any success so far.
There are a couple of problems and several places where your script can be improved.
filename= echo $FILE | head -c 15
This pipeline runs echo $FILE adding the variable filename having the null string as value in its environment. This value of the variable is visible only to the echo command, the variable is not set in the current shell. echo does not care about it anyway.
You probably want to capture the output of echo $FILE | head -c 15 into the variable filename but this is not the way to do it.
You need to use command substitution for this purpose:
filename=$(echo $FILE | head -c 15)
head -c outputs only the first 15 characters of the input file (they can be on multiple lines but this does not happen here). head is not the most appropriate way for this. Use cut -c-15 instead.
But for what you need (extract the first 15 characters of the value stored in the variable $FILE), there is a much simpler way; use a form of parameter expansion called "substring expansion":
filename=${FILE:0:15}
mv $FILE $combined_name
Before running mv, the variables $FILE and $combined_name are expanded (it is called "parameter expansion"). This means that the variable are replaced by their values.
For example, if the value of FILE is abc def and the value of combined_name is mnp opq, the line above becomes:
mv abc def mnp opq
The mv command receives 4 arguments and it attempts to move the files denoted by the first three arguments into the directory denoted by the fourth argument (and it probably fails).
In order to keep the values of the variables as single words (if they contain spaces), always enclose them in double quotes. The correct command is:
mv "$FILE" "$combined_name"
This way, in the example above, the command becomes:
mv "abc def" "mnp opq"
... and mv is invoked with two arguments: abc def and mnp opq.
combined_name="$filename$file_extension"
There isn't any problem in this line. The quotes are simply not needed.
The variables filename and file_extension are expanded (replaced by their values) but on assignments word splitting is not applied. The value resulted after the replacement is the value assigned to variable combined_name, even if it contains spaces or other word separator characters (spaces, tabs, newlines).
The quotes are also not needed here because the values do not contain spaces or other characters that are special in the command line. They must be quoted if they contain such characters.
string_contains="dummy_axial_y_position"
string_dontwant="dummy_axial_y_position_time"
file_extension=".csv"
It is not not incorrect to quote the values though.
for FILE in *
do
if [[ "$FILE" == *"$string_contains"* ]];then
if [[ "$FILE" != *"$string_dontwant"* ]]; then
This is also not wrong but it is inefficient.
You can use the expression from the if condition directly in the for statement (and get rid of the if statement):
for FILE in *"$string_contains"*; do
if [[ "$FILE" != *"$string_dontwant"* ]]; then
...
If you have read and understood the above (and some of the linked documentation) you will be able to figure out yourself where were your files moved :-)

Rename every unicode file of a linux directory in ascii

I am trying to rename all unicode files name in ASCII.
I wanted to do something like this :
for file in `ls | egrep -v ^[a-z0-9\._-]+$`; do mv "$file" $(echo "$file" | slugify); done
But it doesn't work yet.
first, regexp ^[a-z0-9\._-]+$ doesn't seem to be enough.
second, slugify also transform the extension of the file so I have to cut the extension first and after put it back.
Any idea of a way to do that ?
First thing first, don't parse the output of ls. That is, in general, a bad idea, especially if you're expecting files that have any sort of strange characters in their names.
Assuming slugify does what you want with filenames in general, try:
for file in * ; do
if [ -f "$file" ] ; then
ext=${file##*.}
name=${file%.*}
new_name=$(echo "$name"|slugify)
if [[ $name != $new_name ]] ; then
echo mv -v "$name.$ext" "$new_name.$ext"
fi
fi
done
Warning: this will fail if you have files without an extension (it'll double-up the filename). See this other answer by Doctor J if you need to handle that.

In a small script to monitor a folder for new files, the script seems to be finding the wrong files

I'm using this script to monitor the downloads folder for new .bin files being created. However, it doesn't seem to be working. If I remove the grep, I can make it copy any file created in the Downloads folder, but with the grep it's not working. I suspect the problem is how I'm trying to compare the two values, but I'm really not sure what to do.
#!/bin/sh
downloadDir="$HOME/Downloads/"
mbedDir="/media/mbed"
inotifywait -m --format %f -e create $downloadDir -q | \
while read line; do
if [ $(ls $downloadDir -a1 | grep '[^.].*bin' | head -1) == $line ]; then
cp "$downloadDir/$line" "$mbedDir/$line"
fi
done
The ls $downloadDir -a1 | grep '[^.].*bin' | head -1 is the wrong way to go about this. To see why, suppose you had files named a.txt and b.bin in the download directory, and then c.bin was added. inotifywait would print c.bin, ls would print a.txt\nb.bin\nc.bin (with actual newlines, not \n), grep would thin that to b.bin\nc.bin, head would remove all but the first line leaving b.bin, which would not match c.bin. You need to be checking $line to see if it ends in .bin, not scanning a directory listing. I'll give you three ways to do this:
First option, use grep to check $line, not the listing:
if echo "$line" | grep -q '[.]bin$'; then
Note that I'm using the -q option to supress grep's output, and instead simply letting the if command check its exit status (success if it found a match, failure if not). Also, the RE is anchored to the end of the line, and the period is in brackets so it'll only match an actual period (normally, . in a regular expression matches any single character). \.bin$ would also work here.
Second option, use the shell's ability to edit variable contents to see if $line ends in .bin:
if [ "${line%.bin}" != "$line" ]; then
the "${line%.bin}" part gives the value of $line with .bin trimmed from the end if it's there. If that's not the same as $line itself, then $line must've ended with .bin.
Third option, use bash's [[ ]] expression to do pattern matching directly:
if [[ "$line" == *.bin ]]; then
This is (IMHO) the simplest and clearest of the bunch, but it only works in bash (i.e. you must start the script with #!/bin/bash).
Other notes: to avoid some possible issues with whitespace and backslashes in filenames, use while IFS= read -r line; do and follow #shellter's recommendation about double-quotes religiously.
Also, I'm not very familiar with inotifywait, but AIUI its -e create option will notify you when the file is created, not when its contents are fully written out. Depending on the timing, you may wind up copying partially-written files.
Finally, you don't have any checking for duplicate filenames. What should happen if you download a file named foo.bin, it gets copied, you delete the original, then download a different file named foo.bin. As the script is now, it'll silently overwrite the first foo.bin. If this isn't what you want, you should add something like:
if [ ! -e "$mbedDir/$line" ]; then
cp "$downloadDir/$line" "$mbedDir/$line"
elif ! cmp -s "$downloadDir/$line" "$mbedDir/$line"; then
echo "Eeek, a duplicate filename!" >&2
# or possibly something more constructive than that...
fi

Renaming a file extension without specifying

I am creating a bash shell script that will rename a file extension without having to specify the old file extension name. If I enter "change foo *" to the Terminal in Linux, it will change all file extension to foo.
So lets say I've got four files: "file1.txt", "file2.txt.txt", "file3.txt.txt.txt" and "file4."
When I run the command, the files should look like this: "file1.foo", "file2.txt.foo", "file3.txt.txt.foo" and "file4.foo"
Can someone look at my code and correct it. I would also appreciate it if someone can implement this for me.
#!/bin/bash
shift
ext=$1
for file in "$#"
do
cut=`echo $FILE |sed -n '/^[a-Z0-9]*\./p'`
if test "${cut}X" == 'X'; then
new="$file.$ext"
else
new=`echo $file | sed "s/\(.*\)\..*/\1.$ext/"`
fi
mv $file $new
done
exit
Always use double quotes around variable substitutions, e.g. echo "$FILE" and not echo $FILE. Without double quotes, the shell expands whitespace and glob characters (\[*?) in the value of the variable. (There are cases where you don't need the quotes, and sometimes you do want word splitting, but that's for a future lesson.)
I'm not sure what you're trying to do with sed, but whatever it is, I'm sure it's doable in the shell.
To check if $FILE contains a dot: case "$FILE" in *.*) echo yes;; *) echo no;; esac
To strip the last extension from $FILE: ${FILE%.*}. For example, if $FILE is file1.txt.foo, this produces file1.txt. More generally, ${FILE%SOME_PATTERN} expands to $FILE with a the shortest suffix matching SOME_PATTERN stripped off. If there is no matching suffix, it expands to $FILE unchanged. The variant ${FILE%%SOME_PATTERN} strips the longest suffix. Similarly, ${FILE#SOME_PATTERN} and ${FILE##SOME_PATTERN} strip a suffix.
test "${TEMP}X" == 'X' is weird. This looks like a misremembered trick from the old days. The normal way of writing this is [ "$TEMP" = "" ] or [ -z "$TEMP" ]. Using == instead of = is a bash extension. There used to be buggy shells that might parse the command incorrectly if $TEMP looked like an operator, but these have gone the way of the dinosaur, and even then, the X needs to be at the beginning, because the problematic operators begin with a -: [ "X$TEMP" == "X" ].
If a file name begins with a -, mv will think it's an option. Use -- to say “that's it, no more options, whatever follows is an operand”: mv -- "$FILE" "$NEW_FILE".
This is very minor, but a common (not universal) convention is to use capital letters for environment variables and lowercase letters for internal script variables.
Since you're using only standard shell features, you can start the script with #!/bin/sh (but #!/bin/bash works too, of course).
exit at the end of the script is useless.
Applying all of these, here's the resulting script.
#!/bin/sh
ext="$1"; shift
for file in "$#"; do
base="${file%.*}"
mv -- "$file" "$base.$ext"
done
Not exactly what you are asking about, but have a look at the perl rename utility. Very powerful! man rename is a good start.
Use: for file in *.gif; do mv $file ${file%.gif}.jpg; done
Or see How to rename multiple files
For me this worked
for FILE in `ls`
do
NEW_FILE=${FILE%.*}
NEW_FILE=${NEW_FILE}${EXT}
done
I just want to tell about NEW_FILE=${FILE%.*}.
Here NEW_FILE gets the file name as output. You can use it as you want.
I tested in bash with uname -a = "Linux 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux"

How do I determine file encoding in OS X?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.
Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "#" by the file listing:
-rw-r--r--# 1 me users 2021 Feb 11 18:05 my_file.tex
(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)
I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.
Using the -I (that's a capital i) option on the file command seems to show the file encoding.
file -I {filename}
In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.
For instance if you go into Terminal and use vi to create a file eg. vi test.txt
then insert some characters and include an accented character (try ALT-e followed by e)
then save the file.
They type file -I text.txt and you should get a result like this:
test.txt: text/plain; charset=utf-8
The # means that the file has extended file attributes associated with it. You can query them using the getxattr() function.
There's no definite way to detect the encoding of a file. Read this answer, it explains why.
There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.
vim -c 'execute "silent !echo " . &fileencoding | q' {filename}
aliased somewhere in my bash configuration as
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
so I just type
vic {filename}
On my vanilla OSX Yosemite, it yields more precise results than "file -I":
$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8
You can also convert from one file type to another using the following command :
iconv -f original_charset -t new_charset originalfile > newfile
e.g.
iconv -f utf-16le -t utf-8 file1.txt > file2.txt
Just use:
file -I <filename>
That's it.
Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.
Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.
Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.
Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}
You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.
The # sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).
Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.
A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.
Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.
You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.
I implemented the bash script below, it works for me.
It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.
If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.
If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.
Tested on Darwin 15.6.0.
#!/bin/bash
if [[ $# -lt 1 ]]
then
echo "ERROR: need one input argument: file of which the enconding is to be detected."
exit 3
fi
if [ ! -e "$1" ]
then
echo "ERROR: cannot find file '$1'"
exit 3
fi
if [[ $# -ge 2 ]]
then
MAX_DIFF_LINES=$2
else
MAX_DIFF_LINES=10
fi
#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
echo $ENCOD
exit 0
fi
#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
SINK=$1.$i.$RANDOM
iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
if [ $? -eq 0 ]
then
DIFF=$(diff $1 $SINK)
if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
then
echo "===== $i ====="
echo "$DIFF"
echo "Does that make sense [N/y]"
read $ANSWER
if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
then
echo $i
exit 0
fi
fi
fi
#clean up re-encoded file
rm -f $SINK
done
echo "None of the encondings worked. You're stuck."
exit 3
Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:
% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:
% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}
As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.
A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80)
Multibyte sequences follow the pattern shown in the wiki article
If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

Resources