PDF does not use utf-8 string encoding like Go

PDF does not use utf-8 string encoding like Go - go

I am working with libray (https://github.com/unidoc/unipdf) for Go to process PDF files. By using 'SetReason' method I try to set reason of signing of my pdf file.
func (_aggg *PdfSignature )SetReason (reason string ){_aggg .Reason =_gb .MakeString (reason )};
This leads to cyrillic text become unclear symbols (as shown in the picture).
unclear cyricclic symbols
original text is: "русский > Request Id = 12, Task Id = 145"
And it is all ok with cyrillic symbols in main content of PDF file. The problem is in 'Signs'('Подписи') part (as shown in the picture).
In the library there is a mention: (see 'NOTE')
// MakeString creates an PdfObjectString from a string.
// NOTE: **PDF does not use utf-8 string encoding like Go so `s` will often not be a utf-8 encoded
// string.**
func MakeString(s string) *PdfObjectString { _aaad := PdfObjectString{_gcae: s}; return &_aaad }
I want to my pdf file's 'reason' become readable cyrillic symbols,
so, is there any solutions for this ? Hope, I explained the problem ...

It should work if you use core.MakeEncodedString
https://apidocs.unidoc.io/unipdf/latest/github.com/unidoc/unipdf/v3/core/#MakeEncodedString
signature.Reason = core.MakeEncodedString("русский > Request Id = 12, Task Id = 145", true)
func MakeEncodedString(s string, utf16BE bool) *PdfObjectString
MakeEncodedString creates a PdfObjectString with encoded content, which can be either UTF-16BE or PDFDocEncoding depending on whether utf16BE is true or false respectively.
This will store the reason in UTF-16BE which is appropriate for this text.
Disclosure: I am the original developer of UniPDF.

Related

Get TIFF tag value (including non-ASCII characters) from TIFF images in Java 11

I am trying to read different tag values (like tags 259 (Compression), 33432 (Copyright), 306 (DateTime), 315 (Artist) etc.) from a TIFF image in Java. Can anyone suggest what is best way to get those values in Java 11 ?
i tried to get those values using tiffinfo commands (like "tiffinfo -c myfile.tif"). But i did not find any specific command in tiffinfo (libtiff) or any Java library which will give me the specific tag values (e.g. DateTime) of a TIFF image.
Update:
As haraldK suggested, i tried with ImageIO like following
try (ImageInputStream input = ImageIO.createImageInputStream(tiffFile)) {
ImageReader reader = ImageIO.getImageReaders(input).next(); // TODO: Handle reader not found
reader.setInput(input);
IIOMetadata metadata = reader.getImageMetadata(0);
TIFFDirectory ifd = TIFFDirectory.createFromMetadata(metadata);
TIFFField dateTime = ifd.getTIFFField(306);
String dateString = dateTime.getAsString(0);
}
But it does not give exact value of the tag. In case of non-ASCII value (ö, ü, ä etc), question marks replace the real values.
Can anyone tell me how to get the exact value (including non-ASCII) of the tag from TIFFField ?

You can use standard ImageIO, read the TIFF image metadata and get the requested values from it, like this by using some extra support classes in the JDK, starting from Java 9:
try (ImageInputStream input = ImageIO.createImageInputStream(tiffFile)) {
ImageReader reader = ImageIO.getImageReaders(input).next(); // TODO: Handle reader not found
reader.setInput(input);
IIOMetadata metadata = reader.getImageMetadata(0); // 0 is the index of first image
TIFFDirectory ifd = TIFFDirectory.createFromMetadata(metadata);
TIFFField dateTime = ifd.getTIFFField(306); // Yes, that's 3 F's...
String dateString = dateTime.getAsString(0); // TIFF dates are strings...
}
tiffFile must be a valid (existing, readable) java.io.File, java.io.RandomAccessFile or java.io.InputStream (or other supported input, this is plugin-based, really). If not, input will be null, and the code will fail.
You can use similar, but a lot more verbose version, that will work in older versions of Java, as long as you have a TIFF plugin:
try (ImageInputStream input = ImageIO.createImageInputStream(tiffFile)) {
ImageReader reader = ImageIO.getImageReaders(input).next(); // TODO: Handle reader not found
reader.setInput(input);
IIOMetadata metadata = reader.getImageMetadata(0); // 0 is the index of first image
// Get "native" TIFF metadata for first IFD
IIOMetadataNode root = metadata.getAsTree("com_sun_media_imageio_plugins_tiff_image_1.0");
Node ifd = root.getFirstChild();
NodeList fields = ifd.getElementsByTagName("TIFFField"); // Yes, that's 3 F's...
for (int i = 0; i < fields.getLength(); i++) {
Element field = (Element) fields.item(i);
if ("306".equals(field.getAttribute("number"))) {
// This is your DateTime (306) tag,
// now do something with it 😀
// ...
}
}
}
Hardly elegant code, though... The Java 9+ approach is much cleaner and easier to reason about.

What is the default Content Encoding for jmeter HTTP Request sampler?

I am looking at the Content Encoding field in the HTTP Request sampler. Don't confuse this with the HTTP Content-Type header.
By default the value in the Content Encoding field is empty. What does empty mean? What is the default content encoding for jmeter HTTPRequest? Is it ASCII or ANSI or UTF-8?
This guide only mentions that it is not a required field.

Dmitri's answer points to code related to encoding of query strings, but this led me to looking at the code of PostWriter class which creates the actual body of the request - and if sampler does not provide a content encoding - ISO-8859-1 is used for encoding of the body:
public static final String ENCODING = StandardCharsets.ISO_8859_1.name();
...
String contentEncoding = sampler.getContentEncoding();
if(contentEncoding == null || contentEncoding.length() == 0) {
contentEncoding = ENCODING;
}

To point out why I downvoted the reply above (Dmitri T's answer):
I had issues with spanish accents and characters, and I spent hours trying to figure it out, assuming the above answers: that leaving it blank is equivalent to UTF-8. After specifically setting it to UTF-8, no more spanish characters issues.
SO, DO NOT LEAVE BLANK IF YOU NEED UTF-8.

As per JMeter 5.2
Looking into HTTPSamplerBase.java:1136
// Check if the sampler has a specified content encoding
if (JOrphanUtils.isBlank(lContentEncoding)) {
// We use the encoding which should be used according to the HTTP spec, which is UTF-8
lContentEncoding = EncoderCache.URL_ARGUMENT_ENCODING;
}
Looking into EncoderCache:31
/** The encoding which should be usd for URLs, according to HTTP specification */
public static final String URL_ARGUMENT_ENCODING = StandardCharsets.UTF_8.name();
So leaving the field blank is equal to setting it to UTF-8

Is there an API for com.apple.TextEncoding?

When you save an NSString (or Swift.String) using a method like this, it writes the xattr "com.apple.TextEncoding". When you load it back with one of the corresponding methods, it checks this xattr and uses that as the default encoding.
Is there any API to determine the encoding of a file, according to this xattr, without having to load the contents of the file?
I know it's not that hard to parse "IANA name, semicolon, CFStringEncoding uint32, (optional other stuff)", but I'd rather avoid it if there's a built-in way.

If I understand your question correctly, you're asking for a way to read the value of the "com.apple.TextEncoding" extended file attribute. This is possible via API declared in <sys/xattr.h>.
Here's a post that extends URL with extended attributes capabilities:
Write extend file attributes swift example
Example usage:
func getTextEncodingAttribute(for url: URL) -> String? {
do {
let data = try url.extendedAttribute(forName: "com.apple.TextEncoding")
return String(data: data, encoding: .utf8)
} catch _ {
}
return nil
}

How I can correctly read txt file in windows phone?

I try to read txt file with next content:
I using this method to read txt file:
public string ReadFileContents()
{
//this verse is loaded for the first time so fill it from the text file
var ResrouceStream = Application.GetResourceStream(new Uri("Files/info.txt", UriKind.Relative));
if (ResrouceStream != null)
{
Stream myFileStream = ResrouceStream.Stream;
if (myFileStream.CanRead)
{
StreamReader myStreamReader = new StreamReader(myFileStream);
//read the content here
return myStreamReader.ReadToEnd();
}
}
return string.Empty;
}
This method return to me next string with wrong symbols:
How I can correctly read txt file??

How are you showing the text? Maybe it's the printing code rather than the reading code.
Also, is the BOM correct on the file? I believe the first 3 bytes specify the encoding type. Are they correct for this encoding?

Wrong symbols: When reading an ANSI encoded text-file on Windows Phone will cause umlauts, special characters etc. "looking wrong" since you have to use UTF-8 on Windows Phone.

Read image IPTC data

I'm having some trouble with reading out the IPTC data of some images, the reason why I want to do this, is because my client has all the keywords already in the IPTC data and doesn't want to re-enter them on the site.
So I created this simple script to read them out:
$size = getimagesize($image, $info);
if(isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
print '<pre>';
var_dump($iptc['2#025']);
print '</pre>';
}
This works perfectly in most cases, but it's having trouble with some images.
Notice: Undefined index: 2#025
While I can clearly see the keywords in photoshop.
Are there any decent small libraries that could read the keywords in every image? Or am I doing something wrong here?

I've seen a lot of weird IPTC problems. Could be that you have 2 APP13 segments. I noticed that, for some reasons, some JPEGs have multiple IPTC blocks. It's possibly the problem with using several photo-editing programs or some manual file manipulation.
Could be that PHP is trying to read the empty APP13 or even embedded "thumbnail metadata".
Could be also problem with segments lenght - APP13 or 8BIM have lenght marker bytes that might have wrong values.
Try HEX editor and check the file "manually".

I have found that IPTC is almost always embedded as xml using the XMP format, and is often not in the APP13 slot. You can sometimes get the IPTC info by using iptcparse($info['APP1']), but the most reliable way to get it without a third party library is to simply search through the image file from the relevant xml string (I got this from another answer, but I haven't been able to find it, otherwise I would link!):
The xml for the keywords always has the form "<dc:subject>...<rdf:Seq><rdf:li>Keyword 1</rdf:li><rdf:li>Keyword 2</rdf:li>...<rdf:li>Keyword N</rdf:li></rdf:Seq>...</dc:subject>"
So you can just get the file as a string using file_get_contents(get_attached_file($attachment_id)), use strpos() to find each opening (<rdf:li>) and closing (</rdf:li>) XML tag, and grab the keyword between them using substr().
The following snippet works for all jpegs I have tested it on. It will fill the array $keys with IPTC tags taken from an image on wordpress with id $attachment_id:
$content = file_get_contents(get_attached_file($attachment_id));
// Look for xmp data: xml tag "dc:subject" is where keywords are stored
$xmp_data_start = strpos($content, '<dc:subject>') + 12;
// Only proceed if able to find dc:subject tag
if ($xmp_data_start != FALSE) {
$xmp_data_end = strpos($content, '</dc:subject>');
$xmp_data_length = $xmp_data_end - $xmp_data_start;
$xmp_data = substr($content, $xmp_data_start, $xmp_data_length);
// Look for tag "rdf:Seq" where individual keywords are listed
$key_data_start = strpos($xmp_data, '<rdf:Seq>') + 9;
// Only proceed if able to find rdf:Seq tag
if ($key_data_start != FALSE) {
$key_data_end = strpos($xmp_data, '</rdf:Seq>');
$key_data_length = $key_data_end - $key_data_start;
$key_data = substr($xmp_data, $key_data_start, $key_data_length);
// $ctr will track position of each <rdf:li> tag, starting with first
$ctr = strpos($key_data, '<rdf:li>');
// Initialize empty array to store keywords
$keys = Array();
// While loop stores each keyword and searches for next xml keyword tag
while($ctr != FALSE && $ctr < $key_data_length) {
// Skip past the tag to get the keyword itself
$key_begin = $ctr + 8;
// Keyword ends where closing tag begins
$key_end = strpos($key_data, '</rdf:li>', $key_begin);
// Make sure keyword has a closing tag
if ($key_end == FALSE) break;
// Make sure keyword is not too long (not sure what WP can handle)
$key_length = $key_end - $key_begin;
$key_length = (100 < $key_length ? 100 : $key_length);
// Add keyword to keyword array
array_push($keys, substr($key_data, $key_begin, $key_length));
// Find next keyword open tag
$ctr = strpos($key_data, '<rdf:li>', $key_end);
}
}
}
I have this implemented in a plugin to put IPTC keywords into WP's "Description" field, which you can find here.

ExifTool is very robust if you can shell out to that (from PHP it looks like?)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PDF does not use utf-8 string encoding like Go - go

Related

Get TIFF tag value (including non-ASCII characters) from TIFF images in Java 11

What is the default Content Encoding for jmeter HTTP Request sampler?

Is there an API for com.apple.TextEncoding?

How I can correctly read txt file in windows phone?

Read image IPTC data

Categories

Resources