I've an issue with reading records from flat file due to ^M special char. When there is ^M char found data it, Its takeup as new line and due to this my data is completly messed up in target the system.
I am using FlatFileItemReader to read the file. I cant change the source file. Is there any way to handle this issue.
sample File:
1|2234|3|stu ID|secutiry||rak
1|2243|4|srch|ffh
hhy||kum
1|2234|3|stu ID|secutiry||rak
Pass to the FlatFileItemReader a BufferedReaderFactory that returns a subclass of BufferedReader where BufferedReader::readline(boolean) has been overridden in a way that it doesn't treat '\r' specially.
Related
My Installation needs to check the result of a command from cmd.exe. Thus, I redirect the result of the command to a text file and then try to read the file to get the result as follows:
// send command to cmd to execute and redirect the result to a text file
// try to read the file
szDir = "D:\\";
szFileName = "MyFile.txt";
if Is(FILEEXISTS, szDir ^ szFileName) then
listID = ListCreate(STRINGLIST);
if listID != LIST_NULL then
if OpenFIleMode(FILE_MODE_NORMAL) = 0 then
if OpenFile(nFileHandle, szDir, szFileName) = 0 then
// I run into problem here
while (GetLine(nFileHandle, szCurLine) = 0 )
ListAddString(listID, szCurLine, AFTER);
endwhile;
CloseFile(nFileHandle);
endif;
endif;
endif;
endif;
The problem is that right after the command prompt is executed and the result is redirected to MyFile.txt, I can set open file mode, open the file but I can not read any text into my list. ListReadFromFile() does not helps. If I open the file, edit and save it manually, my script works.
After debugging, I figured that GetLine() returns an error code (-1) which means the file pointer must be at the end of file or other errors. However, FILE_MODE_NORMAL sets the file as read only and SET THE FILE POINTER AT THE BEGINNING OF THE FILE.
What did I possibly do wrong? Is this something to do with read/write access of the file? I tried this command without result:
icacls D:\MyFile.txt /grant Administrator:(R,W)
I am using IstallShield 2018 and Windows 10 64-bit btw. Your help is much appreciated.
EDIT 1: I suspected the encoding and tried a few things:
After running "wslconfig /l", the content of MyFile.txt opened in Notepad++ is without an encoding, but still appeared normal and readable. I tried to converted the content to UTF-8 but it did not work.
If I add something to the file (echo This line is appended >> MyFile.txt), the encoding changed to UTF-8, but the content in step 1 is changeed also. NULL (\0) is added to between every character and even repelace new line character. Maybe this is why GetLine() failed to read the file.
Work around: after step 1, I run "find "my_desired_content" MyFile.txt" > TempFile.txt and read TempFile.txt (which is encoded in UTF-8).
My ultimate goal is to check if "my_desired_content" apeears in the result of "wslconfig /l" so this is fine. However, what I don't understand is that both MyFile.txt and TempFile.txt are created from cmd command but they are encoded differently?
The problem is due to the contents of the file. Assuming this is the file generated by your linked question, you can examine its contents in a hex editor to find out the following facts:
Its contents are encoded in UTF-16 (LE) without a BOM
Its newlines are encoded as CR or CR CR instead of CR LF
I thought the newlines would be more important than the text encoding, but it turns out I had it backwards. If I change each of these things independently, GetLine seems to function correctly for either CR, CR CR, or CR LF, but only handles UTF-16 when the BOM is present. (That is, in a hex editor, the file starts with FF FE 57 00 instead of 57 00 for a file starting with the character W.)
I'm at a bit of a loss for the best way to address this. If you're up for a challenge, you could read the file with FILE_MODE_BINARYREADONLY, and can use your extra knowledge about what should be in the file to ensure you interpret its encoding correctly. Note that for most of UTF-16, you can create a single code unit by combining two bytes in the following manner:
szResult[i] = (nHigh << 8) + nLow;
where nHigh and nLow are probably values like szBuffer[2*i + 1] and szBuffer[2*i], assuming you filled a STRING szBuffer by calling ReadBytes.
Other unproven ideas include editing it in binary to ensure the BOM (FF FE) is present, figuring out ways to ensure the file is originally created with the BOM, figuring out ways to create it in an alternate encoding, finding another command you can invoke to "fix" the file, or lodging a request with the vendor (my employer) and hoping the development team changes something to better handle this case.
Here's an easier workaround. If you can safely assume that the command will append UTF-16 characters without a signature, you can append this output to a file that has just a signature. How do you get such a file?
You could create a file with just the BOM in your development environment, and add it to your Support Files. If you need to use it multiple times, copy it around first.
You could create it with code. Just call the following (error checking omitted for clarity)
OpenFileMode(FILE_MODE_APPEND_UNICODE);
CreateFile(nFileHandle, szDir, szFileName);
CloseFile(nFileHandle);
and if szDir ^ szFileName didn't exist, it will now be a file with just the UTF-16 signature.
Assuming this file is called sig.txt, you can then invoke the command
wslconfig /l >> sig.txt to write to that file. Note the doubled >> for append. The resulting file will include the Unicode signature you created ahead of time, plus the Unicode data output from wslconfig, and GetLine should interpret things correctly.
The biggest problem here is that this hardcodes around the behavior of wslconfig, and that behavior may change at any point. This is why Christopher alludes to recommending an API, and I agree completely. In the mean time, You could try to make this more robust by invoking it in a cmd /U (but my understanding of what that does or guarantees is fuzzy at best), or by trying the original way and then with the BOM.
This whole WSL thing is pretty new. I don't see any APIs it but rather then screen scrapping command outputs you might want to look at this registry key:
HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Lxss
It seems to have the list of installed distros that come from the store. Coming from the store probably explains why this is HKCU and not HKLM.
A brave new world.... sigh.
This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform
I have some 100,000+ files with partially mangled data, mixed text+binary files (a single file of jpg image data with http headers), where some header fields have dos style ^M^J line termination, and some only unix style ^J. When vim opens a file like this, it treats it as unix format. So all header lines where there is no ^M, one needs to be added. But this has proven to be very tough.
:1,11s/Cache-Control:.*\zs^M\{0,}$/^M/
doesn't work, and i've tried all kinds of variations of that, even using \=printf("%s","^M") as substitution string. But the result is always a new empty line in the file.
The ONLY way i'm able to add a ^M by a command at all is by
:exe "normal A\<c-q>\<c-m>\<Esc>"
Ok so one way would be to first remove any existing ^M, and then add it by previous. But is there a more elegant, one command solution?
(So that there would be no more misunderstandings, here's a short example of such a file:
HTTP/1.1 200 OK
Server: Apache/2.2.3
(more lines...)
Cache-Control: public, max-age=214748
(more lines...)
ÿØÿá Exif II* ÿì
)
Edit/solution: regarding 100,000+ files, here's a version (regarding missing ^M only on cache-control lines) that only matches if ^M is missing (as not all files are mangled, this will save large amounts of time together with "update!"):
:1,11s/^Cache-Control:.\{-}\zs\(^M*$\)\(^M\)\#<!/\^M/i
A single command might look like :v/^M/s/$/\^M/. This uses <C-v><C-m>, which is to say... it inserts a literal ^M character that's escaped with a backslash.
I frequently deal with UTF-16LE files encoded on windows which have a \r\n carriage return. There is no problem converting the file to UTF-8 by using:
File.new(filepath, 'r:utf-16le:utf-8')
But this of course does not get rid of the \r. The way I currently get rid of them is with
str.gsub("\r", "")
But it would be nice to take care of it while reading the file in. String#encode has :cr_newline, :crlf_newline, and :universal_newline options which convert all newlines to a desired kind of newline. Is there a way to apply these or similar options while reading in a file?
The method IO#gets takes an optional argument that allows you to pass a string to define how to separate the lines:
file = File.new(filepath, 'r:utf-16le:utf-8')
while (line = file.gets("\r\n"))
...
end
I'd be happy with ZipInputStream taking indecent liberties with the line endings that are stored in a file if it would at least get them right for the platform I'm storing the file on. Unfortunately, I pull a text file (.txt, .cpp. .etc.) out of a zip and the \n (0x0A) gets replaced with a \r\n (0x0d0a) and, as you can imagine, this is causing me a great deal of trouble.
Is there a flag I can set to tell it either to avoid changing the line endings altogether or to use one of my choosing?
Thanks.
(I've checked the zip file, my creation of it, etc. I've extracted it using other zip tools and verified that it is archived properly. I've stepped through my project with rdebug and seen that the ZipInputStream call to read() is returning \r\n for line endings.)
if you have an open(filename) or open(filename,"r") call in your code, try to replace it with open(filename,"rb")