Charaters/Bytes count and File size on Windows properties difference - filesize

I have a txt file got generated through PHP script. File character count is shown correctly as 3999 bytes/characters when i checked through my script.
When i checked the same content by copy&paste into MS-Word, still it was showing 3999 characters/bytes(with spaces).
However, when i looked at the windows property of the same txt file, it shows the size as 4.17 KB (4,278 bytes).
I am just wondering, what could be the reason for such big margin of difference when i had looked at it. If someone can clarify this, it would be great.
Thanks in advance.

Related

Copy / paste on Windows 10 corrupts data by encoding first few characters

I'm currently having an issue with copy / pasting on Windows corrupting data by encoding the first few characters.
For instance, let's say I have a file test.txt with content:
This is the content of a test file with some words and some letters. This is a second sentence with words.
If I copy / paste this file, the content end up being:
ÃJ8 J#K‘ÍÖg•Ã‘ÍÖj j j ת ome letters. This is a second sentence with words.
The result is always the same.
I tried on both my hard drives as well as using Powershell command Copy-Item:
Copy-Item -Path "C:\Users\<me>\test.txt" -Destination "C:\Users\<me>\test-new.txt"
And it gives exactly the same result. It seems that this applies to all kinds of files since copy / pasting some files (JPG, RAR) result in an error opening them.
I didn't have the issue before, it apparently occured some days ago. Do you have any idea what could be causing this? Or how to fix it?
Thanks a lot!
Code page should match your locale so you can check it as well as registry.
for example, on my machine(UK locale), code page is 850 (for Western Europe):
>chcp
Active code page: 850
also registry key:
Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
Value for OEMCP key in my case is 65001 (which is UTF-8).
There in the registry, you can change it permanently should need be (requires restart).
and also
the bottom status bar in the code editor, for example, VsCode:
in my experience sometimes without my intervention changes to BOM, hence weird characters appear. Please change it to UTF-8. Registry, code page and watching status bar in the editors solve problem in most cases.

Save special characters to a CSV that can be opened both on PC(Excel) and Mac(Numbers)

I have a script (that I run on a Mac) that writes degree C (unit for temperature in celsius) to a CSV file. I want this file to be viewed in Excel and Numbers. The problem is that it opens fine on Numbers, but shows weird characters on Excel(Windows, I haven't tested Excel on Mac).
I tried both ℃ (the unicode character) and °C (a degree character followed by a C). On Excel I get this:
I'm pretty sure the csv file is UTF-8 encoded, so I don't know what causes the issue.
Here's something else I noticed, if I save as .txt instead of .csv and open it in excel, then an import wizard shows up. If I just leave anything as default and choose 'Finish' then the symbol does show up correctly. But it's not ideal because my users won't be able to double click on the file to open it.
What is the best way to have the special character display in both programs using the same file?
An answer in this post resolved my issue.
Is it possible to force Excel recognize UTF-8 CSV files automatically?
I have to add \uFEFF to the very beginning of my CSV file.

Include pictures while converting ps1 to exe with PowerGUI

I use the PowerGUI editor to convert a ps1 file to an exe file. Reason is that I dont want people to see my source code. The script includes a own little GUI with a picture on it. My problem is that after converting the script to an exe file the picture will only be shown when it exists on a specific path. If I delete or move the picture from that path it wont be shown when starting the exe.
How can I include the picture to the exe? I want to have only one file in the end ...
One way you could do this is by converting your image into a Base 64 String, using the following:
[convert]::ToBase64String((get-content C:\YourPicture.jpg -encoding byte)) > C:\YourString.txt
With the string that is produced in the text file "C:\YourString.txt" you can copy and paste it into your code and load it into a picture object on the form like so:
$logo.Image = [System.Convert]::FromBase64String('
iVBORw0KGgoAAAANSUhEUgAAAfQAAACMCAYAAACK0FuSAAAABGdBTUEAALGPC/xhBQAAAAlwSFlz
AAAXEQAAFxEByibzPwAAAAd0SU1FB98DFA8VLc5RQx4AANRpSURBVHhe7J0FmBTX0oZ7F1jc3d3d
nU+dOtXw4MGDZfft25fmyJEjPu+//76Xqzke8pCHPOQhD3lIQGxxcGSapvHdd9+FF3hHEdjGFHgn
....... Many more lines of string .......
OqelFQzDMAzD/CZoztADbUwhUm0JERoXCNfEQYhyPAQryiBIUQSBiiTwl1sb3skwDMMwzG+KWLVe
jTEBH6U7JvJ0CJQXoaGPQMWBm5W+Va27k14MwzDMfwCA/wfUstOLO+nBIAAAAABJRU5Jggg==')
Do this will mean that your image is stored within the code already and doesn't need to be loaded from somewhere.
Note: Make sure that the pictures size on disk is the smallest you can get it as producing the string can take sometime and could turn out thousands of lines long. So I would recommend that you only use a pic that is less that 75 Kilobytes in size. You could do it with larger one but this will take a long time to process.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Truncated files when copying CSV files using FileSystemObject

I am helping my son write a program to format files to load into another system. I have done this before with no trouble. Now I get a 13 KB comma delimited text file and I am copying it use FSO to another file with a csv extension. For some reason the new file always stops at the same place, about6 records from the end of the file original file. I thought it might be something with the record after the line where it stopped so I move the record in the file. No change stopped at the same place. So moved the records above where it stopped. Still the same problem. It stops at 13 KB and leaves off about6 records. The only thing I can think of is file size, but is is below the limit of the VB CopyFile. I have imported the original file into Excel no problem. I have done a rename of the file and opened it in Excel no problem. Please give me and idea of where to go next.
I've heard of this happening before with fso, but I haven't heard of a solution (or a cause, for that matter). If you're using vb.net, you can use the my.computer.filesystem.filecopy function instead of fso. If you're using vb6, you can also copy a file this way, although it's not very elegant.
Dim s As String
Open sourcename For Binary As 1
s = String(LOF(1), " ")
Get 1, , s
Close 1
Open destname For Binary As 1
put 1, , s
close 1

Resources