Transcoding with Gradle - gradle

My build.gradle file generates UNIX and Windows launcher scripts for my Java project from templates.
Templates are UTF-8 encoded and the generated scripts are UTF-8 too. It's not a problem on Linux where UTF-8 support is ubiquitous, but Windows has some issues displaying non Latin-1 characters in cmd.exe terminal window. After reading Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) I come to a conclusion that converting the generated UTF-8 script to cp1250 (in my case) would save me lots of trouble when displaying hungarian text. However I couldn't figure out how to convert a UTF-8 file to other code page (looked at copy, but didn't find a way to specify output encoding.)

Simply use FileUtils from Apache Commons IO in your build file.
import org.apache.commons.io.FileUtils
buildscript {
repositories {
mavenCentral()
}
dependencies {
classpath("commons-io:commons-io:2.8.0")
}
}
And then, in the relevant part of the script, where launcher scripts are generated :
File f = file('/path/to/windows-launcher')
// Reading the content as UTF-8
String content = FileUtils.readFileToString(f, 'UTF-8')
// Rewriting the file as cp1250
FileUtils.write(f, content, "cp1250")

Related

Visual Studio 2019 does not properly convert UTF-8 strings to UTF-16 strings in source files without BOM

I have the following source file (encoded in UTF-8 without BOM, displayed fine in the Source Code Editor):
#include <Windows.h>
int main()
{
MessageBoxW(0, L"Umlaute ÄÖÜ, 🙂", nullptr, 0);
return 0;
}
When running the program, the special characters (Umlaute and Emoji) are messed up in the Message Box.
However, if I save the source file manually as "UTF-8 with BOM", Visual Studio will properly convert the string to UTF-16 and when running the program, the special characters are displayed in the Message Box. But it would be annoying to convert every single file to UTF-8 with BOM. (Also, I think GCC for example does not like BOM?)
Why is Visual Studio messing up my string, if there is no BOM in the source file? The Auto-detect UTF-8 encoding without signature option is already enabled.
I tested the same source with MinGW-w64 and don't have the issue, regardless if there is a BOM or not.
Use the /utf-8 compiler switch. The MS compiler assumes a legacy ANSI encoding (Windows-1252 on US and Western European versions of Windows) if no BOM is found in the source file.

How to automatic search and replace for 18n.properties file with WebStorm

For SAPUI5 there are i18n.properties files.
For the German language I need to replace the special German chars with the unicode codes.
# AE = \u00C4, ae = \u00E4
# OE = \u00D6, oe = \u00F6
# UE = \u00DC, ue = \u00FC
# SZ = \u00DF
How can I automate this search and replace with WebStorm?
You could just use WebStorms 'Replace in Path' (CMD+SHIFT+R on Mac) on your i18n folder. IntelliJ IDEA has better editing support for .properties files though (since they are coming from java)
Will be also easy to do this via a node script/bash script/gulp task whatsoever.
Btw: Is this really needed? Having all .properties files in UTF-8 should just do the trick. Afaik only Tomcat got confused by that since in the Java spec these files are ISO-8859-1 by definition. As long as you are deploying to a platform that accepts them as UTF-8 there shouldn't be an issue.
BR
Chris
PS: That code looks really familiar ;D

UTF-8 i18n file

I'm trying to add a Chinese localisation to a scaffolded Yesod site. I have a zh.msg message file saved as UTF-8 format using Notepad in Windows, but when I run cabal install in the project directory, I get this:
Handler\Home.hs:15:11:
Not in scope: data constructor `MsgHello'
Perhaps you meant `Msg<stderr>: hPutChar: invalid argument (invalid character)
The line in question is where I render my homepage:
$(widgetFile "homepage")
I changed both message files to be Unicode formatted instead of UTF-8, and get this message instead:
Foundation.hs:1:1:
Exception when trying to run compile-time code:
Cannot decode byte '\xff': Data.Text.Encoding.Fusion.streamUtf8: invalid UTF-8 stream
So I guess UTF-8 is the way to go... somehow.
(I'm using Notepad because I haven't set up gVim to render Unicode characters. It's apparently a bit of a feat.)
When I went to commit my changes I discovered the issue. The diff for my English file looked like this:
-Hello: Hello
+<U+FEFF>Hello: Hello
I guess notepad added the character in, and it was working its way into the Haskell code. I solved it using vim according to this answer.

Disable encoding checking in java gradle project

I want to migrate one of our java projects from ant to gradle. This project has got a lot of source code wrote by few programmers. The problem is that some of files are encoded in ANSi and some in UTF-8 (this generates compile errors). I know that I can set encoding using compileJava.options.encoding = 'UTF-8' but this will not work (not all files are encoded in UTF-8). Is it possible to disable encoding checking (I don't want to change encoding of all files)?
This is not an issue with Gradle but with javac. However, you can solve this issue running a one-time groovy script in your gradle build as described below.
Normally you'd only need to add following line to your build.gradle file:
compileJava.options.encoding = 'UTF-8'
However, some text editors when saving files to UTF-8 will generate a byte order mark (BOM) header at the beginning of the text files.
And javac does not understand the BOM, not even when you compile with encoding="UTF-8" option so you're probably getting an error such as this:
> javac -encoding UTF8 Test.java
Test.java:1: error: illegal character: \65279
?class Test {
You need to strip the BOM from your source files or convert your source file to another encoding. Notepad++ for example can convert the file encoding from one to another.
For lots of source files you can easily write a simple task in Groovy/Gradle to open your source text files and convert the UTF-8 removing the BOM prefix from the first line if found.
Add this to your build.gradle and run gradle convertSource
task convertSource << {
// convert sources files in source set to normalized text format
sourceSets.main.java.each { file ->
// read first "raw" line via BufferedReader
def r = new BufferedReader(new FileReader(file))
String s = r.readLine()
r.close()
// get entire file normalized
String text = file.text
// get first "normalized" line
String normalizedLine = new StringReader(text).readLine()
if (s != normalizedLine) {
println "rename: $file"
File target = new File(file.getParentFile(), file.getName() + '.bak')
if (!target.exists()) {
if (file.renameTo(target))
file.setText(text)
else
println "failed to rename or target already exists"
}
}
}
} // end task
The convertSource task will simply enumerate all of the source files, read first "raw" line from each source file then read the normalized text lines and compare first lines. If the first line is different then it would output a new target file with the normalized text and save backup of the original source. Only need to run convertSource task one-time after which you can remove original source files and the compile should work without getting encoding errors.

Get encoding of a file in Windows

This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?
Open up your file using regular old vanilla Notepad that comes with Windows.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
Whatever the default-selected encoding is, that is what your current encoding is for the file.
If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).
I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.
FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe
If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:
file *
This will give you the encoding details of all the files in that folder.
The (Linux) command-line tool 'file' is available on Windows via GnuWin32:
http://gnuwin32.sourceforge.net/packages/file.htm
If you have git installed, it's located in C:\Program Files\git\usr\bin.
Example:
C:\Users\SH\Downloads\SquareRoot>file *
_UpgradeReport_Files; directory
Debug; directory
duration.h; ASCII C++ program text, with CRLF line terminators
ipch; directory
main.cpp; ASCII C program text, with CRLF line terminators
Precision.txt; ASCII text, with CRLF line terminators
Release; directory
Speed.txt; ASCII text, with CRLF line terminators
SquareRoot.sdf; data
SquareRoot.sln; UTF-8 Unicode (with BOM) text, with CRLF line terminators
SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary info
SquareRoot.vcproj; XML document text
SquareRoot.vcxproj; XML document text
SquareRoot.vcxproj.filters; XML document text
SquareRoot.vcxproj.user; XML document text
squarerootmethods.h; ASCII C program text, with CRLF line terminators
UpgradeLog.XML; XML document text
C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
_UpgradeReport_Files; binary
Debug; binary
duration.h; us-ascii
ipch; binary
main.cpp; us-ascii
Precision.txt; us-ascii
Release; binary
Speed.txt; us-ascii
SquareRoot.sdf; binary
SquareRoot.sln; utf-8
SquareRoot.sln.docstates.suo; binary
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary infobinary
SquareRoot.vcproj; us-ascii
SquareRoot.vcxproj; utf-8
SquareRoot.vcxproj.filters; utf-8
SquareRoot.vcxproj.user; utf-8
squarerootmethods.h; us-ascii
UpgradeLog.XML; us-ascii
Another tool that I found useful: https://archive.codeplex.com/?p=encodingchecker
EXE can be found here
Install git ( on Windows you have to use git bash console). Type:
file --mime-encoding *
for all files in the current directory , or
file --mime-encoding */*
for the files in all subdirectories
Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).
Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by #Sybren, and I show how to do that via PowerShell in a later answer.
# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' { return 'utf8' }
'^2b2f76' { return 'utf7' }
'^fffe' { return 'unicode' }
'^feff' { return 'bigendianunicode' }
'^0000feff' { return 'utf32' }
default { return 'ascii' }
}
}
dir ~\Documents\WindowsPowershell -File |
select Name,#{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} |
ft -AutoSize
Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)
A simple solution might be opening the file in Firefox.
Drag and drop the file into firefox
Press Ctrl+I to open the page info
and the text encoding will appear on the "Page Info" window.
Note: If the file is not in txt format, just rename it to txt and try again.
P.S. For more info see this article.
I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use #Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).
Add this to your profile.ps1:
$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe
And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.
But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.
The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.
you can simply check that by opening your git bash on the file location then running the command file -i file_name
example
user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8
Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html
Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.
EDIT:
A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).
# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# set .net current directoy
[Environment]::CurrentDirectory = (pwd).path
}
process {
$reader = [System.IO.StreamReader]::new($filename,
[System.Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
.\get-encoding chinese8.txt
Name BodyName EncodingName
---- -------- ------------
chinese8.txt utf-8 Unicode (UTF-8)
get-childitem -file | .\get-encoding
Looking for a Node.js/npm solution? Try encoding-checker:
npm install -g encoding-checker
Usage
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
Examples
Get encoding of all files in current directory:
encoding-checker
Return encoding of all md files in current directory:
encoding-checker -p "*.md"
Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):
encoding-checker -p "**"
For more examples refer to the npm docu or the official repository.
Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."
The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.
Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).
The only way that I have found to do this is VIM or Notepad++.
EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.

Resources