Script for changing text encoding in Xcode 4

Script for changing text encoding in Xcode 4 - xcode

I have a set of .txt files saved in ISO Latin 1.
Before I put them in my application I have to convert them to UTF-8
Is it possible to create a script for this task in Xcode 4 ?

You can set up a shell script build phase that invokes iconv. Your 'from' encoding would be CSISOLATIN1 and your 'to' encoding would be UTF-8.

Thanks!
I made a shell script and it works very well :
/bin/sh
TXTPATH="${SRCROOT}/assets/Pages"
TXT_EXT='*.txt'
for f in $(find $TXTPATH -name $TXT_EXT); do
iconv -f iso-8859-1 -t UTF-8 "$f" > "${f%.*}.utf8"
done

Related

How to use an output filename that is only different in the extension?

I'm trying to create an Automator service that converts a MTS file to MP4 from the macOS Finder. To do so I need to setup a little bash script, but I don't know how to use input filename (for example, "file.MTS") and then generate a file called "file.mp4" with HandBrakeCLI.
I'm doing something wrong when I'm trying to assign the filename without the extension to a variable and then using it, but I don't know what is the problem:
for if in "$#"
do
dest=`"$(basename "$if" | sed 's/\(.*\)\..*/\1/')"`
/Applications/HandBrakeCLI -i "$if" -o "$dest".mp4 --preset="Fast 1080p30"
done

You don't need sed; basename already knows how to strip a given extension.
for if in "$#"; do
dest=$(basename "$if" .MTS).mp4
/Applications/HandBrakeCLI -i "$if" -o "$dest" --preset="Fast 1080p30"
done
As mentioned in a comment, the backticks are unnecessary and incorrect.
For example,
$ basename /path/to/foo.txt .txt
foo
If you don't know the actual extension ahead of time, parameter expansion is sufficient.
dest=$(basename "$if")
dest=${dest%.*} # Strip at most one extension
or
dest=${dest%%.*} # Strip *all* extensions

rake notes hits utf-8 error

running rake notes is hitting an invalid byte sequence in UTF-8 but even with trace turned on, does not point to any offending file, just items in the ruby version and railties 4.2.4 directories.
Manually removing all the notes and stashing them does not change the behaviour. Is there any way to determine where this character is stopping the process?

The following run within the directory of choice
find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}" > utf8_fail
will generate a file of all failing items. Then iterate through all culprits ending in .builder, .rb, .erb, .haml and .slim to find the guilty party/ies
fwiw, mine was a nasty alias created by OS X that bloated to 1.1 MB

How can I be sure of the file encoding?

I have a PHP file that I created with VIM, but I'm not sure which is its encoding.
When I use the terminal and check the encoding with the command file -bi foo (My operating system is Ubuntu 11.04) it gives me the next result:
text/html; charset=us-ascii
But, when I open the file with gedit it says its encoding is UTF-8.
Which one is correct? I want the file to be encoded in UTF-8.
My guess is that there's no BOM in the file and that the command file -bi reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.

$ file --mime my.txt
my.txt: text/plain; charset=iso-8859-1

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.
Now to your question:
Run this command:
tr -d \\000-\\177 < your-file | wc -c
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

(on Linux)
$ chardet <filename>
it also delivers the confidence level [0-1] of the output.

Based on #Celada answer and the #Arthur Zennig, I have created this simple script:
#/bin/bash
if [ "$#" -lt 1 ]
then
echo "Usage: utf8-check filename"
exit 1
fi
chardet $1
countchars="$(tr -d \\000-\\177 < $1 | wc -c)"
if [ $countchars -eq 0 ]
then
echo "Ascii";
exit 0
fi
{
iconv -f utf-8 -t ucs-4 < $1 >/dev/null
echo "UTF-8"
} || {
echo "not UTF-8 or corrupted"
}

shell script to change .pdf files to .png

Not sure how to ask a question about a previous posted question so if I should just add to the previous question somehow feel free to tell me.
Anyway, I posted a question yesterday
shell script to change .pdf files to .png Mac OS 10.3
sips does work and if I do it on the command line one file at a time it works but my for loop isn't working
heres what I got
for pdf in *{pdf,PDF} ; do
sips -s format png --out "${pdf%%.*}.png" "$pdf"
done
and its saying
Warning: *{pdf, not a valid file - skipping
Error 4: no file was specified
Try 'sips --help' for help using this tool
thanks

Looks fine to me. Are you sure you are using bash to execute this script and not /bin/sh?
Make sure your first line is:
#! /bin/bash
Try echoing the files and see if it works:
for pdf in *{pdf,PDF} ; do
echo "$pdf"
done

If your shell is bash you can do this
shopt -s nullglob
This changes the behavior of bash when no globs match. Normally if you say *pdf and there are no files ending in "pdf" it will return a literal asterisk instead. Setting nullglob makes bash do what you would expect and return nothing in such a case.
Alternately, and more robustly, you could do it this way
for pdf in *[pP][dD][fF] ; do
sips -s format png --out "${pdf%%.*}.png" "$pdf"
done
Which should work without nullglob being set and in all shells that support parameter substitution with this syntax. Note that this is still not robust on case sensitive filesystems due to a risk of name collision if you have two PDF files whose names differ only due to the case of the extension. To handle this case properly you could do
for pdf in *[pP][dD][fF] ; do
sips -s format png --out "${pdf%%.*}.$(tr pPdDfF pPnNgG <<<"${pdf#*.}")" "$pdf"
done
This should be sufficiently robust.
EDIT: Updated to correct incorrect $pdf expansion in the extension.

UPDATED2
#!/bin/bash
# Patters patching no files are expanded into null string
# this will allow us to make this script work when no files
# exist in this directory with this extension
shopt -s nullglob
# Consider only the lowercase 'pdf' extensions and make them lowercase 'png'
for b in *.pdf
do
c="$b"
b="${b/.pdf/}"
convert"$c" "$b.png"
done
# Consider only the uppercase 'pdf' extensions and make them uppercase 'png'
for b in *.PDF
do
c="$b"
b="${b/.PDF/}"
convert "$c" "$b.PNG"
done
Note that the convert program is a part of the ImageMagick program.

How do I determine file encoding in OS X?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.
Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "#" by the file listing:
-rw-r--r--# 1 me users 2021 Feb 11 18:05 my_file.tex
(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)
I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.

Using the -I (that's a capital i) option on the file command seems to show the file encoding.
file -I {filename}

In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.
For instance if you go into Terminal and use vi to create a file eg. vi test.txt
then insert some characters and include an accented character (try ALT-e followed by e)
then save the file.
They type file -I text.txt and you should get a result like this:
test.txt: text/plain; charset=utf-8

The # means that the file has extended file attributes associated with it. You can query them using the getxattr() function.
There's no definite way to detect the encoding of a file. Read this answer, it explains why.
There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.

vim -c 'execute "silent !echo " . &fileencoding | q' {filename}
aliased somewhere in my bash configuration as
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
so I just type
vic {filename}
On my vanilla OSX Yosemite, it yields more precise results than "file -I":
$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8

You can also convert from one file type to another using the following command :
iconv -f original_charset -t new_charset originalfile > newfile
e.g.
iconv -f utf-16le -t utf-8 file1.txt > file2.txt

Just use:
file -I <filename>
That's it.

Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.

Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.
Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.
Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}
You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.

The # sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).

Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.
A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.

Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.

You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.

I implemented the bash script below, it works for me.
It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.
If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.
If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.
Tested on Darwin 15.6.0.
#!/bin/bash
if [[ $# -lt 1 ]]
then
echo "ERROR: need one input argument: file of which the enconding is to be detected."
exit 3
fi
if [ ! -e "$1" ]
then
echo "ERROR: cannot find file '$1'"
exit 3
fi
if [[ $# -ge 2 ]]
then
MAX_DIFF_LINES=$2
else
MAX_DIFF_LINES=10
fi
#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
echo $ENCOD
exit 0
fi
#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
SINK=$1.$i.$RANDOM
iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
if [ $? -eq 0 ]
then
DIFF=$(diff $1 $SINK)
if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
then
echo "===== $i ====="
echo "$DIFF"
echo "Does that make sense [N/y]"
read $ANSWER
if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
then
echo $i
exit 0
fi
fi
fi
#clean up re-encoded file
rm -f $SINK
done
echo "None of the encondings worked. You're stuck."
exit 3

Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:
% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:
% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}
As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.

A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80)
Multibyte sequences follow the pattern shown in the wiki article
If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Script for changing text encoding in Xcode 4 - xcode

I have a set of .txt files saved in ISO Latin 1. Before I put them in my application I have to convert them to UTF-8 Is it possible to create a script for this task in Xcode 4 ?

You can set up a shell script build phase that invokes iconv. Your 'from' encoding would be CSISOLATIN1 and your 'to' encoding would be UTF-8.

Thanks! I made a shell script and it works very well : /bin/sh TXTPATH="${SRCROOT}/assets/Pages" TXT_EXT='.txt' for f in $(find $TXTPATH -name $TXT_EXT); do iconv -f iso-8859-1 -t UTF-8 "$f" > "${f%.}.utf8" done

Related

How to use an output filename that is only different in the extension?

rake notes hits utf-8 error

How can I be sure of the file encoding?

shell script to change .pdf files to .png

How do I determine file encoding in OS X?

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Script for changing text encoding in Xcode 4 - xcode

I have a set of .txt files saved in ISO Latin 1. Before I put them in my application I have to convert them to UTF-8 Is it possible to create a script for this task in Xcode 4 ?

You can set up a shell script build phase that invokes iconv. Your 'from' encoding would be CSISOLATIN1 and your 'to' encoding would be UTF-8.

Thanks! I made a shell script and it works very well : /bin/sh TXTPATH="${SRCROOT}/assets/Pages" TXT_EXT='*.txt' for f in $(find $TXTPATH -name $TXT_EXT); do iconv -f iso-8859-1 -t UTF-8 "$f" > "${f%.*}.utf8" done

Related

How to use an output filename that is only different in the extension?

rake notes hits utf-8 error

How can I be sure of the file encoding?

shell script to change .pdf files to .png

How do I determine file encoding in OS X?

Categories

Resources

Thanks! I made a shell script and it works very well : /bin/sh TXTPATH="${SRCROOT}/assets/Pages" TXT_EXT='.txt' for f in $(find $TXTPATH -name $TXT_EXT); do iconv -f iso-8859-1 -t UTF-8 "$f" > "${f%.}.utf8" done