Encoding conversion for large file

Encoding conversion for large file - utf-8

I am faced with a large (~ 18 GB) file, exported from SQL Server as a Unicode text file, which means its encoding is UTF-16 (little endian). The file is now stored in a computer running Linux, but I have not figured out a way to convert it to UTF-8.
At first I tried using iconv, but the file is too large for that. My next approach was using split and converting the files one by one, but that didn't work either - there were a lot of errors during the conversions.
So, any ideas on how to convert this to UTF-8? Any help will be much appreciated.

Since you're using SQL server, I assume your platform is Windows. In the simplest case you can write quick an dirty .NET application, which reads the source line-by-line and writes the converted file as it goes. Something like this:
using System;
using System.IO;
using System.Text;
namespace UTFConv {
class Program {
static void Main(string[] args) {
try {
Encoding encSrc = Encoding.Unicode;
Encoding encDst = Encoding.UTF8;
uint lines = 0;
using (StreamReader src = new StreamReader(args[0], encSrc)) {
using (StreamWriter dest = new StreamWriter(args[1], false, encDst)) {
string ln;
while ((ln = src.ReadLine()) != null) {
lines++;
dest.WriteLine(ln);
}
}
}
Console.WriteLine("Converted {0} lines", lines);
} catch (Exception x) {
Console.WriteLine("Problem converting the file: {0}", x.Message);
}
}
}
}
Just open Visual Studio, start a new C# Console Application project, paste this code in there, compile, and run it from the command line. The first argument is your source file, the second argument is your destination file. Should work.

Related

.NET Core 2 + System.Data.OracleClient. Chinese characters doesn't work

I'm using .NET Core 2 with the System.Data.OracleClient package published some weeks ago here: https://www.nuget.org/packages/System.Data.OracleClient/
I can read numbers, dates and normal English characters. But not Chinese. Probably a lot of other non-western characters.
Here's a sample program to illustrate the error:
using System;
using System.Text;
using System.Diagnostics;
using System.IO;
using System.Data.OracleClient;
namespace OracleConnector
{
class Program
{
static void Main()
{
TestString();
return;
}
private static void TestString()
{
string connStr = "Data Source = XE; User ID = testuser; Password = secret";
using (OracleConnection conn = new OracleConnection(connStr))
{
conn.Open();
var cmd = conn.CreateCommand();
cmd.CommandText = "select 'some text in English language' as a, '储物组合带门/抽屉, 白色 卡维肯, 因维肯 白蜡木贴面' as b from dual";
var reader = cmd.ExecuteReader();
reader.Read();
string sEnglish = reader.GetString(0);
string sChinese = reader.GetString(1);
Trace.WriteLine("English from db: " + sEnglish);
Trace.WriteLine("Chinese from db: " + sChinese);
Trace.WriteLine("Chinese from the code: 储物组合带门 / 抽屉, 白色 卡维肯, 因维肯 白蜡木贴面");
}
}
}
}
It outputs this:
English from db: some text in English languageဂ
Chinese from db: ¿¿¿¿¿¿/¿¿, ¿¿ ¿¿¿, ¿¿¿ ¿¿¿¿¿e
Chinese from the code: 储物组合带门 / 抽屉, 白色 卡维肯, 因维肯 白蜡木贴面
As you can see, Chinese characters from normal code works. But not when it comes from the database. Also, the last character in the English text is some messed up thing. I've also tried the corresponding Mono nuget package with the same result.
Anyone have any clue how to fix this?
Edit: Tried adding Unicode=True to the connection string but Chinese characters still doesn't work.

This is a problem with the System.Data.OracleClient DLL. I am having the same problem where 2, 3, or even 4-byte Unicode characters are getting tacked to the end of my strings.
Switching to Mono.Data.OracleClientCore helped slightly, but I still got some odd characters at the end of some strings (Unicode backspace and backslash).
I just tried the following library, and it seems to work for my needs (so far):
https://github.com/ericmend/oracleClientCore-2.0
You will need to re-compile for Windows (change to #define OCI_WINDOWS in OciCalls.cs). Will update this answer if I find that it doesn't continue to work.
Still, I think that we'll have to wait for Oracle to release their .NET Core supported solution for any sort of production ready library.

Please try
Environment.SetEnvironmentVariable ("NLS_LANG",".UTF8");
before creation of the connection-Object.
The System.Data.OracleClient-Implementations uses external Oracle libraries, which assumes (at least on Windows) the ANSI-Charset.
Setting the NLS_LANG-Environmentvariable informs the Oracle-Libs that you want the UTF8-Encoding.
(much) more Details on the NLS_LANG-FAQ-Page:
http://www.oracle.com/technetwork/database/database-technologies/globalization/nls-lang-099431.html

Append ";Unicode=True" to connectionstring and add Environment.SetEnvironmentVariable ("NLS_LANG",".UTF8"); before create connection
string conn = "DATA SOURCE=hostname.company.org:1521/servicename.company.org;PASSWORD=XYZ;USER ID=ABC;Unicode=True"
Environment.SetEnvironmentVariable("NLS_LANG", ".UTF8");
using (DbConnection conn = create_connection(app_conn))
{
//...
}

JRecord - Formatting file transferred from Mainframe

I am trying to display a mainframe file in a eclipse RCP application using JRecord library.I already have the COBOL copybook as a text file.
to accomplish that,
I am transferring the file from mainframe to my desktop through
apache commons net FTPClient API
Now I have a text file
I am removing the newline and carriage return characters
then I read it via ., a CobolIoProvider and convert it into a ArrayList of type AbstractLine
But I have offset issues because of some special charcters .
here are the issues
when I dont perform step #3 , there are offset issues right from
record 1. hence I included step #3
even when I perform step #3 , the first few thounsands of records seem to be formatted(or read ) by the AbstractLineReader correctly unless it encounters a special character (not sure but thats my assumption).
Code snippet:
ArrayList<AbstractLine> lines = new ArrayList<AbstractLine>();
InputStream copyStream;
InputStream fis;
try {
copyStream = new FileInputStream(new File(copybookfile));
String filec = FileUtils.readFileToString(new File(datafile));
System.out.println("initial len: "+filec.length());
filec=filec.replaceAll("\r", "");
filec=filec.replaceAll("\n", "");
System.out.println("initial len: "+filec.length());
fis= new ByteArrayInputStream(filec.getBytes());
CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader = ioProvider.newIOBuilder(copyStream, "REQUEST",
Convert.FMT_MAINFRAME).newReader(fis);
AbstractLine line;
while ((line = reader.read()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
What am I missing here ? is there an additional preprocessing that I need to do for the file transferred from mainframe ?

If it is a Text File (no binary data) with \r\n line delimiters try:
ArrayList<AbstractLine> lines = new ArrayList<AbstractLine>();
InputStream copyStream;
InputStream fis;
try {
copyStream = new FileInputStream(new File(copybookfile));
AbstractLineReader reader = CobolIoProvider.getInstance()
.newIOBuilder(copyStream, "REQUEST", ICopybookDialects.FMT_MAINFRAME)
.setFileOrganization(Constants.IO_STANDARD_TEXT_FILE)
.newReader(datafile);
AbstractLine line;
while ((line = reader.read()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Note: The setFileOrganization tells JRecord what type of file it is. So .setFileOrganization(Constants.IO_STANDARD_TEXT_FILE) tells JRecord it is a Text file with \n or \r\n end-of-line markers. Here is a Description of FileOrganisation in JRecord.
The special charcters worry me though, if there is a \n in the 'Data' it will be treated as an end-of-line. You may need to do binary transfer and keep the RDW (Record-Descriptor-Word) if it is a VB file.
If The file contains Binary data, you will need:
do a binary transfer (with RDW if it is a VB file)
use the appropriate File-Organisation
Specify Ebcdic (.setFont("cp037") tells JRecord is US-Ebcdic)
I will add a second answer for Generating Code using the RecordEditor
If you are absolutely sure all the records are the same length you can use the low-level routines to do the reading see the ReadAqtrans.java program in https://sourceforge.net/p/jrecord/discussion/678634/thread/4b00fed4/
basically you would do:
ICobolIOBuilder iobuilder = CobolIoProvider.getInstance()
.newIOBuilder("copybookFileName", ICopybookDialects.FMT_MAINFRAME)
.setFont("CP037")
.setFileOrganization(Constants.IO_FIXED_LENGTH);
LayoutDetail layout = iobuilder.getLayout();
FixedLengthByteReader br
= new FixedLengthByteReader(layout.getMaximumRecordLength() + 2);
br.open("...");
byte[] bytes;
while ((bytes = br.read()) != null) {
lines.add(iobuilder.newLine(bytes));
}

Future Reference / Binary File
If the file does contain Binary Data, you really need to do a binary transfer. You may find the RecordEditor useful.
The RecordEditor 0.98 has a JRecord code Generation
function. The advantages of using the RecordEditor Generate function are
The Recordeditor will try and work out the appropriate File attributes by looking at the File
You can try out various attributes (left hand pane) and see what the file looks like with those attributes
(right hand side).
When happy, hit the Generate button and the RecordEditor will generate JRecord code. There are several Code Templates
available:
Standard - will generate basic JRecord code (with a field name class
lineWrapper - will generate a "wrapper" class with the Cobol fields represented as get/set methods
RecordEditor Generate
In the RecordEditor select Generate >>> Java~JRecord code for Cobol
Generate Screen
Enter the Cobol CopyBook / Sample file and adjust the attributes as needed
Code Template
Next you can select the Code Template
Generated Code
Finally the RecordEditor will generate JRecord code based on the Attributes entered.

Find absolute java.exe path programmatically from java code

If I have a java jar or class file which is launched by the user (assuming java path is set in environment variables), so how can i from within the code, figure out absolute path of java.exe/javaw.exe from which this file is being launched.
Like on ubuntu we can run: % which java and it shows the path.
However on windows, if i check System.getenv() it may happen that there are multiple path's found e.g for old or new version. If through cmd line, I run java -version it does not show the path.
Can you tell me either through pure java or command line on windows how is it possible to find out the location of javaw.exe?

String javaHome = System.getProperty("java.home");
Can you tell me either through pure Java ... on windows how is it possible to find out the location of javaw.exe?
E.G.
import java.io.File;
class JavawLocation {
public static void main(String[] args) {
String javaHome = System.getProperty("java.home");
File f = new File(javaHome);
f = new File(f, "bin");
f = new File(f, "javaw.exe");
System.out.println(f + " exists: " + f.exists());
}
}
Output
C:\Program Files (x86)\Java\jdk1.6.0_29\jre\bin\javaw.exe exists: true
Press any key to continue . . .
And yes, I am confident that will work in a JRE.

On Windows, the java.library.path System Property begins with the path to the bin directory containing whichever java.exe was used to run your jar file.
This makes sense, because on Windows the first place any executable looks for DLL files is the directory containing the executable itself. So naturally, when the JVM runs, the first place it looks for DLLs is the directory containing java.exe.
You can acquire the path to java.exe as follows:
final String javaLibraryPath = System.getProperty("java.library.path");
final File javaExeFile = new File(
javaLibraryPath.substring(0, javaLibraryPath.indexOf(';')) + "\\java.exe"
);
final String javaExePath =
javaExeFile.exists() ? javaExeFile.getAbsolutePath() : "java";
This code is Windows-specific - I hard-coded the path separator (;) and the file separator (). I also put in a fallback to just "java" in case the library path trick somehow doesn't work.
I have tested this with Java 6 and 7 on Windows 7. I tried a 32-bit and 64-bit version of Java.

Here's a slightly more generalised solution that I came up with. Maybe useful:
private static String javaExe()
{
final String JAVA_HOME = System.getProperty("java.home");
final File BIN = new File(JAVA_HOME, "bin");
File exe = new File(BIN, "java");
if (!exe.exists())
{
// We might be on Windows, which needs an exe extension
exe = new File(BIN, "java.exe");
}
if (exe.exists())
{
return exe.getAbsolutePath();
}
try
{
// Just try invoking java from the system path; this of course
// assumes "java[.exe]" is /actually/ Java
final String NAKED_JAVA = "java";
new ProcessBuilder(NAKED_JAVA).start();
return NAKED_JAVA;
}
catch (IOException e)
{
return null;
}
}

an issue with using "System.getProperty("java.home");", is that it is not always the java exe that the jar is running on, if you want to get that, you can use "System.getProperty("sun.boot.library.path");", from there you can find "java", "java.exe", "javaw", or "javaw.exe"... However there is still an issue with this, java will run just fine if the executable has been renamed, and the actual java executable's structure changes from different JRE's/JDKS's, so there is not much way to find the java exe if it has been renamed. unless someone else has a method ofc, in which case, can you share? :)
(Also, I have seen some people suggest using the first index of System.getProperty("java.library.path");, note, this might not work if the user/launcher has manually set the library path, something which is not too uncommon)

Compilation of All above methods
static String getJavaPath(){
String tmp1 = System.getProperty("java.home") + "\\bin\\java.exe";
String tmp2 = System.getProperty("sun.boot.library.path") + "\\java.exe";
String tmp3 = System.getProperty("java.library.path")+ "\\java.exe";
if(new File(tmp1).exists()) {
return tmp1;
}else if(new File(tmp2).exists()){
return tmp2;
}else if(new File(tmp3).exists()) {
return tmp3;
}else{
String[] paths = System.getenv("PATH").split(";");
for(String path:paths){
if(new File(path + "\\java.exe").exists()){
return path + "\\java.exe";
}
}
}
return "";
}

Error Appending to IsolatedStorageFile

I am having some problems with Isolated file store , I am trying to append to a file, but when I use the code below, I get an error about invalid Arguments on this line
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileMode.OpenOrCreate, myStore))
I think it has something to do with the Filemode.Append.. I am trying to append to the file rather than create new
// Obtain the virtual store for the application.
IsolatedStorageFile myStore = IsolatedStorageFile.GetUserStoreForApplication();
// Create a new folder and call it "MyFolder".
myStore.CreateDirectory("Folder");
// Specify the file path and options.
using (var isoFileStream = new IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append, FileMode.OpenOrCreate, myStore))
{
//Write the data
using (var isoFileWriter = new StreamWriter(isoFileStream))
{
isoFileWriter.WriteLine(textBox1.Text);
isoFileWriter.WriteLine(textBox2.Text);
isoFileWriter.WriteLine(textBox3.Text);
}
}

There is no overload that takes two FileModes. It should be
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileAccess.Write, myStore));
Important thing to note about FileMode.Append is:
[FileMode.Append] Opens the file if it exists and seeks to the end of the file, or
creates a new file. Append can only be used in conjunction with Write.
Attempting to seek to a position before the end of the file will throw
an IOException and any attempt to read fails and throws an
NotSupportedException.
which is why FileAccess.Write is used.

It looks like you have FileMode.Append, FileMode.OpenOrCreate. That is 2 file modes. The first parameter is be FileMode and the second should be FileAccess.
That should fix your problem.

Save all files in Visual Studio project as UTF-8

I wonder if it's possible to save all files in a Visual Studio 2008 project into a specific character encoding. I got a solution with mixed encodings and I want to make them all the same (UTF-8 with signature).
I know how to save single files, but how about all files in a project?

Since you're already in Visual Studio, why not just simply write the code?
foreach (var f in new DirectoryInfo(#"...").GetFiles("*.cs", SearchOption.AllDirectories)) {
string s = File.ReadAllText(f.FullName);
File.WriteAllText (f.FullName, s, Encoding.UTF8);
}
Only three lines of code! I'm sure you can write this in less than a minute :-)

This may be of some help.
link removed due to original reference being defaced by spam site.
Short version: edit one file, select File -> Advanced Save Options. Instead of changing UTF-8 to Ascii, change it to UTF-8. Edit: Make sure you select the option that says no byte-order-marker (BOM)
Set code page & hit ok. It seems to persist just past the current file.

In case you need to do this in PowerShell, here is my little move:
Function Write-Utf8([string] $path, [string] $filter='*.*')
{
[IO.SearchOption] $option = [IO.SearchOption]::AllDirectories;
[String[]] $files = [IO.Directory]::GetFiles((Get-Item $path).FullName, $filter, $option);
foreach($file in $files)
{
"Writing $file...";
[String]$s = [IO.File]::ReadAllText($file);
[IO.File]::WriteAllText($file, $s, [Text.Encoding]::UTF8);
}
}

I would convert the files programmatically (outside VS), e.g. using a Python script:
import glob, codecs
for f in glob.glob("*.py"):
data = open("f", "rb").read()
if data.startswith(codecs.BOM_UTF8):
# Already UTF-8
continue
# else assume ANSI code page
data = data.decode("mbcs")
data = codecs.BOM_UTF8 + data.encode("utf-8")
open("f", "wb").write(data)
This assumes all files not in "UTF-8 with signature" are in the ANSI code page - this is the same what VS 2008 apparently also assumes. If you know that some files have yet different encodings, you would have to specify what these encodings are.

Using C#:
1) Create a new ConsoleApplication, then install Mozilla Universal Charset Detector
2) Run code:
static void Main(string[] args)
{
const string targetEncoding = "utf-8";
foreach (var f in new DirectoryInfo(#"<your project's path>").GetFiles("*.cs", SearchOption.AllDirectories))
{
var fileEnc = GetEncoding(f.FullName);
if (fileEnc != null && !string.Equals(fileEnc, targetEncoding, StringComparison.OrdinalIgnoreCase))
{
var str = File.ReadAllText(f.FullName, Encoding.GetEncoding(fileEnc));
File.WriteAllText(f.FullName, str, Encoding.GetEncoding(targetEncoding));
}
}
Console.WriteLine("Done.");
Console.ReadKey();
}
private static string GetEncoding(string filename)
{
using (var fs = File.OpenRead(filename))
{
var cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null)
Console.WriteLine("Charset: {0}, confidence: {1} : " + filename, cdet.Charset, cdet.Confidence);
else
Console.WriteLine("Detection failed: " + filename);
return cdet.Charset;
}
}

I have created a function to change encoding files written in asp.net.
I searched a lot. And I also used some ideas and codes from this page. Thank you.
And here is the function.
Function ChangeFileEncoding(pPathFolder As String, pExtension As String, pDirOption As IO.SearchOption) As Integer
Dim Counter As Integer
Dim s As String
Dim reader As IO.StreamReader
Dim gEnc As Text.Encoding
Dim direc As IO.DirectoryInfo = New IO.DirectoryInfo(pPathFolder)
For Each fi As IO.FileInfo In direc.GetFiles(pExtension, pDirOption)
s = ""
reader = New IO.StreamReader(fi.FullName, Text.Encoding.Default, True)
s = reader.ReadToEnd
gEnc = reader.CurrentEncoding
reader.Close()
If (gEnc.EncodingName <> Text.Encoding.UTF8.EncodingName) Then
s = IO.File.ReadAllText(fi.FullName, gEnc)
IO.File.WriteAllText(fi.FullName, s, System.Text.Encoding.UTF8)
Counter += 1
Response.Write("<br>Saved #" & Counter & ": " & fi.FullName & " - <i>Encoding was: " & gEnc.EncodingName & "</i>")
End If
Next
Return Counter
End Function
It can placed in .aspx file and then called like:
ChangeFileEncoding("C:\temp\test", "*.ascx", IO.SearchOption.TopDirectoryOnly)

if you are using TFS with VS :
http://msdn.microsoft.com/en-us/library/1yft8zkw(v=vs.100).aspx
Example :
tf checkout -r -type:utf-8 src/*.aspx

Thanks for your solutions, this code has worked for me :
Dim s As String = ""
Dim direc As DirectoryInfo = New DirectoryInfo("Your Directory path")
For Each fi As FileInfo In direc.GetFiles("*.vb", SearchOption.AllDirectories)
s = File.ReadAllText(fi.FullName, System.Text.Encoding.Default)
File.WriteAllText(fi.FullName, s, System.Text.Encoding.Unicode)
Next

If you want to avoid this type of error :
Use this following code :
foreach (var f in new DirectoryInfo(#"....").GetFiles("*.cs", SearchOption.AllDirectories))
{
string s = File.ReadAllText(f.FullName, Encoding.GetEncoding(1252));
File.WriteAllText(f.FullName, s, Encoding.UTF8);
}
Encoding number 1252 is the default Windows encoding used by Visual Studio to save your files.

Convert from UTF-8-BOM to UTF-8
Building on rasx's answer, here is a PowerShell function that assumes your current files are already encoded in UTF-8 (but maybe with BOM) and converts them to UTF-8 without BOM, therefore preserving existing Unicode characters.
Function Write-Utf8([string] $path, [string] $filter='*')
{
[IO.SearchOption] $option = [IO.SearchOption]::AllDirectories;
[String[]] $files = [IO.Directory]::GetFiles((Get-Item $path).FullName, $filter, $option);
foreach($file in $files)
{
"Writing $file...";
[String]$s = [IO.File]::ReadAllText($file, [Text.Encoding]::UTF8);
[Text.Encoding]$e = New-Object -TypeName Text.UTF8Encoding -ArgumentList ($false);
[IO.File]::WriteAllText($file, $s, $e);
}
}

Experienced encoding problems after converting solution from VS2008 to VS2015. After conversion all project files was encoded in ANSI, but they contained UTF8 content and was recongnized as ANSI files in VS2015. Tried many conversion tactics, but worked only this solution.
Encoding encoding = Encoding.Default;
String original = String.Empty;
foreach (var f in new DirectoryInfo(path).GetFiles("*.cs", SearchOption.AllDirectories))
{
using (StreamReader sr = new StreamReader(f.FullName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
continue;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
var utf8Text = Encoding.UTF8.GetString(utf8Bytes);
File.WriteAllText(f.FullName, utf8Text, Encoding.UTF8);
}

the item is removed from the menu in Visual Studio 2017
You can still access the functionality through File-> Save As -> then clicking the down arrow on the Save button and clicking "Save With Encoding...".
You can also add it back to the File menu through Tools->Customize->Commands if you want to.

I'm only offering this suggestion in case there's no way to automatically do this in Visual Studio (I'm not even sure this would work):
Create a class in your project named 足の不自由なハッキング (or some other unicode text that will force Visual Studio to encode as UTF-8).
Add "using MyProject.足の不自由なハッキング;" to the top of each file. You should be able to do it on everything by doing a global replace of "using System.Text;" with "using System.Text;using MyProject.足の不自由なハッキング;".
Save everything. You may get a long string of "Do you want to save X.cs using UTF-8?" messages or something.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Encoding conversion for large file - utf-8

Related

.NET Core 2 + System.Data.OracleClient. Chinese characters doesn't work

JRecord - Formatting file transferred from Mainframe

Find absolute java.exe path programmatically from java code

Error Appending to IsolatedStorageFile

Save all files in Visual Studio project as UTF-8

Categories

Resources