I have a question, how to restore PDF file, if all I have is the only ASCII output?
Example:
%PDF-1.3
%���������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x�ѽ
�0�ݧ8O�����[�AAqp� �jK|{S�"�f�2���[�
�(M#���#�FFIw�=*��?J4'�P�y^TP`�Q�
+�i�E�8ψ�g���º��(6�֭,���s0�T��ZL�~�e�.EA��`J�f��<��M�
[...]
0000120481 00000 n
0000122448 00000 n
trailer
<</Size 94 /Root 57 0 R /Prev 116103 /Info 1 0 R>>
startxref
122488
%%EOF
It's the beginning and end of output I have and I need to restore it back into a readable form. I tried a few things, but I was unlucky.
It is impossible, the information was lost.
You can't represent binary data as a printable text using ASCII encoding in the 'One Byte' to 'One Char' ratio.
There are many non-printable characters in the ASCII table that could be supressed when converting the pdf binary file contents, destroying the original data.
Quoted-Printable encoding and Base64 encoding are more suitable for such application.
Check this out: Binary-to-text_encoding
Related
My Tcl source files are in utf-8. Tclhttpd would not send national characters properly, so I modified it a bit. However, I also send binary stuff like jpg images and sometimes binary chunks are present in my otherwise utf-8 HTML. I have difficulty calculating the proper Content-length to match exactly what the browser receives (otherwise some trailing characters clobber the next-request headers or the browser keeps waiting 30 sec per request, until a timeout).
In other words, can I please know how many bytes did puts $socket write into the socket?
I have discovered a particular 11-byte sequence that messes up counting:
proc dump3 string {
binary scan $string c* c
binary scan $string H* hex
return [sdump $string]\n$c\n$hex
};#dump3
proc Httpd_ReturnData {sock type content {code 200} {close 0}} {
global Httpd
upvar #0 Httpd$sock data
#...skip non-pertinent code...
set content \x4f\x4e\xc2\x00\x03\xff\xff\x80\x00\x3c\x2f
#content=ONÂÿÿ�</
#79 78 -62 0 3 -1 -1 -128 0 60 47
#4f4ec20003ffff80003c2f
puts content=[dump3 $content]
puts utf8=[dump3 [encoding convertto utf-8 $content]]
if {[catch {
puts "string length=[string length $content] type=$type"
puts "stringblength=[string bytelength $content]"
set len [string length $content]
if [string match -nocase *utf-8* $type] {
fconfigure $sock -encoding utf-8
set len [string bytelength $content]
}
puts "len=$len fcon=[fconfigure $sock]"
HttpdRespondHeader $sock $type $close $len $code
HttpdSetCookie $sock
puts $sock ""
if {$data(proto) != "HEAD"} {
##fconfigure $sock -translation binary -blocking $Httpd(sockblock)
##native: -translation {auto crlf}
fconfigure $sock -translation lf -blocking $Httpd(sockblock)
puts -nonewline $sock $content
}
Httpd_SockClose $sock $close
} err]} {
HttpdCloseFinal $sock $err
}
}
The output on console is:
content=ONÂÿÿ�</
79 78 -62 0 3 -1 -1 -128 0 60 47
4f4ec20003ffff80003c2f
utf8=ON�ÿÿ�</
79 78 -61 -126 0 3 -61 -65 -61 -65 -62 -128 0 60 47
4f4ec3820003c3bfc3bfc280003c2f
string length=11 type=text/html;charset=utf-8
stringblength=17
len=17 fcon=-blocking 0 -buffering full -buffersize 16384 -encoding utf-8 -eofchar {{} {}} -translation {auto crlf} -peername {128.0.0.71 128.0.0.71 55305} -sockname {128.0.0.8 gen 8016}
HttpdRespondHeader 17
The resultant Content-Length: 17 is too much, the browser keeps waiting. If I only could know beforehand, how many bytes puts will make out of my string, the rest would be easy. Is there a way?
For data going over HTTP, the content length should be the number of bytes in the data as observed on the wire. When working with Httpd_ReturnData you need to ensure that you provide it the binary data to transfer; it does not handle encoding the data for you.
To send binary data with a length it's actually easy, and you do:
set binaryData [...]
Httpd_ReturnData $sock "application/octet-stream" $binaryData
# There are many other binary encodings; that's just the most universal one
# Choose the right one for your application, of course
To send text data with a length, you need to do a little more work with encoding convertto:
set textData [...]
Httpd_ReturnData $sock "text/plain; charset=utf-8" \
[encoding convertto utf-8 $textData]
# Similarly, text/plain is a decent fallback here too
(Yes, if you choose a different encoding then you should mention that in both places. You probably ought to use UTF-8 for all text content in this day and age.)
If you can pull the data from a file, you should do so; Httpd_ReturnFile is more efficient than Httpd_ReturnData as it can move the data using efficient data transfer techniques. If sending a text file, you need to be careful to describe the encoding of the file correctly. By far the easiest way to do that is by convention, such as deciding that all text files on your system are UTF-8...
You should virtually never use string bytelength, as that reports in units that are one of Tcl's internal-only encodings (a lightly-denormalized almost-UTF-8). The measure it returns is only correct when you're doing something very weird like generating C code that needs to know buffer sizes that contain strings that will be fed into Tcl's implementation, which is very much not what you're doing (I've only done that sort of thing once in more than 20 years of using Tcl; I've never heard of another legitimate use). I believe it is deprecated precisely because it has a bunch of subtle bugs in how it is used by all too many people.
I think that I'm exceeding EPS max line limits:
I'm generating an eps programmatically that consists of a grid of pictures.
My EPS has this structure:
%!PS-Adobe-3.0 EPSF-3.0
.
.
%%BeginProlog
%%EndProlog
%%Page: 1 1
%%Begin Raster Image. Index: 0
save
449 2576 translate
0 rotate
-282 -304 translate
[1 0 0 1 0 0] concat
0 0 translate
[1 0 0 1 0 0] concat
0 0 translate
userdict begin
DisplayImage
0 0
564 608
12
564 608
0
0
FBDBB9FBDCBCFDDBBAFFD8B2FFD7A9FED4A1FCD29CFDD09EFED0A2FFD0A6FFCDA3FFCBA0FFCBA0...
EED79CEBD09CEDD19EEED2A1EFD3A3F0D4A5F0D4A6F0D4A7F1D4A4F3D4A0F3D49F
end
restore
%%End Raster Image
%%Begin Raster Image. Index: 1
.
.
end
restore
%%End Raster Image
%%Begin Raster Image. Index: 2
etc
So the thing is if I write up to 4 images to the EPS, everything works fine, but when I try to write a 5th, the eps won't open on any EPS viewer including Adobe Illustrator (the operation cannot complete because of an unknown error).
I tried using different images to make sure that the particular images were ok and I got the same result, as long as I'm writing 4 images (105825 lines file) everything works. But when I use 5 (132253 lines file) it fails.
Is it possible that I'm exceeding the maximum line limit for EPS?
This are the files in question if you'd like to analyze them:
the one that works - >https://files.fm/u/bfn2d32m and the one that doesn't ->
https://files.fm/u/4gbybr3y
There's no 'line limit' in PostScript or EPS, so you can't be hitting that.
When I run your file through Ghostscript it throws an error /undefined in yImage (I'd suggest you debug PostScript using a proper PostScript interpreter, not Adobe Illustrator).
This suggests to me that one of your images is using more data than you have supplied, so the interpreter runs off the end of the data, consuming parts of the program, until it has read sufficient bytes from currentfile to satisfy the request. At that point is starts processing the file as PostScript again, but the file pointer now points to the 'yImage' of the next 'DisplayImage'. Since you haven't defined a 'yImage' key, naturally this gives you an 'undefined' error.
From your description, this would seem likely to be the 4th image, since adding the 5th throws the error. Note that if your program terminates without supplying enough data (so the interpreter reaches EOF) then the data supplied will be drawn. So it might look like your 4th image is correct even when it isn't, provided it isn't followed by any further program code.
A style note; PostScript is a stack-based language, so normally one would proceed by pushing values onto the stack and reading them from there, instead of executing the 'token' operator.
So your input would be more like :
0 0
564 608
12
564 608
0
0
DisplayImage
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
And the DisplayImage code would be :
/DisplayImage
{
%
% Display a DirectClass or PseudoClass image.
%
% Parameters:
% x & y translation.
% x & y scale.
% label pointsize.
% image label.
% image columns & rows.
% class: 0-DirectClass or 1-PseudoClass.
% compression: 0-none or 1-RunlengthEncoded.
% hex color packets.
%
gsave
/buffer 512 string def
/byte 1 string def
/color_packet 3 string def
/pixels 768 string def
/compression exch def
/class exch def
/rows exch def
/columns exch def
/pointsize exch def
scale
translate
This avoids you having to use token at all for the scale and translate operations for instance.
I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.
I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:
week_num type ID location total_qty A_qty B_qty count㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㈀㔀㌀ഀ
44 small 14 A 907859 907859 0 550360㐀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㜀㐀ऀ㌀ ㈀
The first few lines of input file are like so:
week_num type ID location total_qty A_qty B_qty count
34 small 197 A 547203 547203 0 91253
44 small 14 A 907859 907859 0 550360
41 small 421 A 302174 302174 0 18198
The strange characters appear to be Line 1 and Line 3 of the data.
Here's my Ruby code:
num_lines=ARGV[0]
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
fw.puts line if outflag
num_lines = num_lines-1
end
Any idea what's going on and what I can do to simply stop at the line end character?
Looking at input/output files in hex (useful suggestion by #user1934428)
Input file - each character looks to be two bytes.
Output file - notice the NULL (00) between each single byte character...
Ruby version 1.9.1
The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.
num_lines=ARGV[0]
# ****** Specifying the right encodings <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")
until (line=fh.gets).nil? or num_lines==0
fw.puts line
num_lines = num_lines-1
end
Useful references:
Working with encodings in Ruby 1.9
CSV encodings
Determining the encoding of a CSV file
0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.
The gist of my question is this:
How can I display Unicode characters in Matlab's GUI (OS X) so that they are properly rendered?
Details:
I have a table of strings stored in a file, and some of these strings contain UTF-8-encoded Unicode characters. I have tried many different ways (too many to list here) to display the contents of this file in the MATLAB GUI, without success. For example:
>> fid = fopen('/Users/kj/mytable.txt', 'r', 'n', 'UTF-8');
>> [x, x, x, enc] = fopen(fid); enc
enc =
UTF-8
>> tbl = textscan(fid, '%s', 35, 'delimiter', ',');
>> tbl{1}{1}
ans =
ÎÎÎÎÎΠΣΦΩαβγδεζηθικλμνξÏÏÏÏÏÏÏÏÏÏ
>>
As it happens, if I paste the string directly into the MATLAB GUI, the pasted string is displayed properly, which shows that the GUI is not fundamentally incapable of displaying these characters, but once MATLAB reads it in, it longer displays it correctly. For example:
>> pasted = 'ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω'
pasted =
>>
Thanks!
I present below my findings after doing some digging... Consider these test files:
a.txt
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
b.txt
தமிழ்
First, we read files:
%# open file in binary mode, and read a list of bytes
fid = fopen('a.txt', 'rb');
b = fread(fid, '*uint8')'; %'# read bytes
fclose(fid);
%# decode as unicode string
str = native2unicode(b,'UTF-8');
If you try to print the string, you get a bunch of nonsense:
>> str
str =
Nonetheless, str does hold the correct string. We can check the Unicode code of each character, which are as you can see outside the ASCII range (last two are the non-printable CR-LF line endings):
>> double(str)
ans =
Columns 1 through 13
915 916 920 923 926 928 931 934 937 945 946 947 948
Columns 14 through 26
949 950 951 952 953 954 955 956 957 958 960 961 962
Columns 27 through 35
963 964 965 966 967 968 969 13 10
Unfortunately, MATLAB seems unable to display this Unicode string in a GUI on its own. For example, all these fail:
figure
text(0.1, 0.5, str, 'FontName','Arial Unicode MS')
title(str)
xlabel(str)
One trick I found is to use the embedded Java capability:
%# Java Swing
label = javax.swing.JLabel();
label.setFont( java.awt.Font('Arial Unicode MS',java.awt.Font.PLAIN, 30) );
label.setText(str);
f = javax.swing.JFrame('frame');
f.getContentPane().add(label);
f.pack();
f.setVisible(true);
As I was preparing to write the above, I found an alternative solution. We can use the DefaultCharacterSet undocumented feature and set the charset to UTF-8 (on my machine, it is ISO-8859-1 by default):
feature('DefaultCharacterSet','UTF-8');
Now with a proper font (you can change the font used in the Command Window from Preferences > Font), we can print the string in the prompt (note that DISP is still incapable of printing Unicode):
>> str
str =
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
>> disp(str)
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπÏςστυφχψω
And to display it in a GUI, UICONTROL should work (under the hood, I think it is really a Java Swing component):
uicontrol('Style','text', 'String',str, ...
'Units','normalized', 'Position',[0 0 1 1], ...
'FontName','Arial Unicode MS', 'FontSize',30)
Unfortunately, TEXT, TITLE, XLABEL, etc.. are still showing garbage:
As a side note: It is difficult to work with m-file sources containing Unicode characters in the MATLAB editor. I was using Notepad++, with files encoded as UTF-8 without BOM.