I have a method in our software that pulls the text from a PDF, from a scan or text generated.
I usually try the GetTextFromPage() method first. If it doesn't return text, then I move onto OCR'ing the page.
I have a particular 6 page PDF with the first three pages being a scanned document, and the last two being a form.
On this PDF I'm getting an error that I can't figure out how to resolve.
'StandardEncoding' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
Parameter name: name
at System.Globalization.EncodingTable.internalGetCodePageFromName(String name)
at System.Globalization.EncodingTable.GetCodePageFromName(String name)
at iText.IO.Util.IanaEncodings.GetEncodingEncoding(String name)
at iText.IO.Util.EncodingUtil.ConvertToBytes(Char[] chars, String encoding)
at iText.IO.Font.PdfEncodings.ConvertToBytes(String text, String encoding)
at iText.IO.Font.FontEncoding.FillNamedEncoding()
at iText.IO.Font.FontEncoding.CreateFontEncoding(String baseEncoding)
at iText.Kernel.Font.PdfType1Font..ctor(PdfDictionary fontDictionary)
at iText.Kernel.Font.PdfFontFactory.CreateFont(PdfDictionary fontDictionary)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.GetFont(PdfDictionary fontDict)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.SetTextFontOperator.Invoke(PdfCanvasProcessor processor, PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[] contentBytes, PdfResources resources)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page, ITextExtractionStrategy strategy, IDictionary`2 additionalContentOperators)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page)
at EFR.OCR.OCR.ExtractTextFromPDF(FileInfo fileInfo, Int32 StartingPage, Int32 NumberOfPages) in P:\Cloud\Dropbox\EF Recovery\OCRTest\EFR.OCR\OCR.vb:line 113
I've processed many PDFs through my code, some text, some scans, some mixed together. Some had forms... This is the first time that I've had this error.
Here's a snippet of my code...
Using reader As New iText.Kernel.Pdf.PdfReader(fileInfo.FullName)
reader.SetUnethicalReading(True)
Using sourceDoc As New iText.Kernel.Pdf.PdfDocument(reader)
If NumberOfPages = 0 Then NumberOfPages = sourceDoc.GetNumberOfPages
For i As Integer = StartingPage To StartingPage + NumberOfPages - 1
Dim pageText As String = ""
Try
pageText = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(sourceDoc.GetPage(i))
Catch ex As Exception
OCRLog.Log($"Error attempting to extract text from page {i}. {ex.ToString}")
End Try
If pageText = "" Then
'extract this page
Dim results As OCRResults = ExtractTextFromPDFImagePage(fileInfo.FullName, i)
pageText = results.Text
pageItems.Add(New OCRResults.PagesClass(results.Accuracy, True, pageText))
Else
pageItems.Add(New OCRResults.PagesClass(100, False, pageText))
End If
stringBuilder.Append(pageText)
Next
Return New OCRResults(stringBuilder.ToString, pageItems)
End Using
End Using
Any ideas?
There is an error in the PDF, just as indicated by the error text "'StandardEncoding' is not a supported encoding name.".
The fonts on the page you shared use the name StandardEncoding in their Encoding entries. This is not a valid name here. According to the specification ISO 32000-1 the only valid values here are MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding, see Table 111 – Entries in a Type 1 font dictionary – and Table 114 – Entries in an encoding dictionary.
Adobe Preflight also complains about these names when checking for syntax errors:
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 38
Traversal Path: ->Pages->Kids->[0]->Resources->Font->WARSP->Encoding
An unexpected value is associated with the key
Key: Encoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Font.FontType1
Cos ID: 27
Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial,Bold
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 22
Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial->Encoding
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 19
Traversal Path: ->Pages->Kids->[0]->Resources->Font->ARROW->Encoding
(Excerpt from a preflight report for your shared PDF)
In spite of StandardEncoding not being a valid name here, the PDF specification knows a "Standard Encoding", see Annex D of ISO 32000-1. Most likely your document attempts to refer to that encoding at the locations outlined above.
If you need to extract text from the document in question, therefore, you may want to follow the recommendation of the error message:
For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
The Encoding class here is the one in System.Text.
To extract the text from your PDF, therefore, it should suffice to implement an EncodingProvider that for the name StandardEncoding provides an Encoding instance according to the information from the STD column of the table in Annex D.2 – Latin Character Set and Encodings – of ISO 32000-1.
Related
I am working on a IPP-FAX client.
I am using the ippserver stuff from: https://github.com/istopwg/ippsample.git
I have a small configuration that provides two printers.
However, when I use the fax-job.test from the library, my client receives the phone number: IPP_DESTINATION_URIS='{destination-uri=tel:4055551212},{destination-uri=ipp://11.22.33.44/ipp/print print-quality=high media=na_letter_8.5x11in}'
As expected.
But: when I use the same ipp-device as a fax from apple's printer menues, the pages are sent, but the destination-uri are not send (not included in the IPP transmission).
I am using the following in the service:
[root#Comp ~/Printing/ippsample3]# perl -ne 's/#.*//; print unless /^\s*$/' t2/print/faxout.conf
MAKE "thilo"
MODEL "(GPL Ghostscript)"
DeviceURI ipp://Comp.local/ipp/print
Attr textWithoutLanguage printer-device-id "MFG:XSimulated;MDL:Fax;CMD:URF;URF:W8,SRGB24,CP255,PQ4,RS200-300-600,V1.4;MINSIZE:1x5in;MAXSIZE:8.5x14in;TEST-MARGINS:0 0 0 0;TEST-NO-PNG:1;TEST-NO-PDF:1;TEST-FAX:1;"
Command /root/Printing/ippsample3/hell.sh
ATTR keyword urf-supported "W8","SRGB24","ADOBERGB24-48","DM3","CP255","OFU0","IS1-4-5-7","IFU0","MT1-2-3-7-8-9-10-11-12","OB9","PQ3-4-5","RS300-600","V1.4"
ATTR keyword job-creation-attributes-supported "copies","confirmation-sheet-print","cover-sheet-info","destination-uris","media","media-col","multiple-document-handling","number-of-retries","page-ranges","print-quality","printer-resolution","retry-interval","retry-time-out"
ATTR uriScheme destination-uri-schemes-supported "tel"
ATTR boolean ipp-attribute-fidelity true
ATTR boolean confirmation-sheet-print-default false
ATTR integer number-of-retries-default 1
ATTR integer retry-interval-default 15
ATTR keyword cover-sheet-info-supported "date-time","from-name","subject","to-name","message"
ATTR no-value cover-sheet-info-default
ATTR rangeOfInteger number-of-retries-supported 0-1
ATTR rangeOfInteger retry-interval-supported 15-60
ATTR uri printer-icons "http://Comp.local:8632/icons/fax.png","http://Comp.local:8632/icons/large/fax.png"
ATTR uri printer-more-info "http://Comp.local:8632/"
ATTR uri printer-supply-info-uri "http://Comp.local:8632/"
ATTR uri printer-uuid "urn:uuid:3f63711e-bcc3-3570-707e-cc14008da4b6"
ATTR keyword uri-authentication-supported "none","none"
ATTR keyword uri-security-supported "tls","tls"
ATTR uri printer-geo-location "geo:37.33182,122.03118"
ATTR uri device-uri "urf:///1+1"
from reading http://ftp.pwg.org/pub/pwg/candidates/cs-ippfaxout10-20140618-5100.15.pdf . I understand that the destination-uris would be mandatory in the job descriptor.
I either fail to teach the IPP-server to require if from the client, or I fail to configure the client correctly.
From: https://github.com/michaelrsweet/libcups/raw/f06f42779f98073e2ba782a7a73ebf54636b60d0/examples/fax-job.test
GROUP job-attributes-tag
ATTR collection destination-uris {
MEMBER uri destination-uri tel:4055551212
},{
MEMBER uri destination-uri ipp://11.22.33.44/ipp/print
MEMBER enum print-quality 5
MEMBER keyword media na_letter_8.5x11in
}
Any hints how the service should be configured to make apple printer also send this scheme?
It seems, that once the faxout uri is changed in the server sources to use '/ipp/faxout' instead of the (I think also standard compliant) /ipp/print/faxout, Apple is sending the destination-uri
and the script gets the environment variable: IPP_DESTINATION_URIS='{destination-uri=tel:1234567890123456}
as expected.
I am having so much trouble getting this syntax to translate - Angular 13.0.02 .
My two resources are:
https://angular.io/api/localize/init/$localize
https://lokalise.com/blog/angular-i18n/
As per the Angular docs:
Naming placeholders
If the template literal string contains expressions, then the expressions will be automatically associated with placeholder names for you.
For example:
$localize `Hi ${name}! There are ${items.length} items.`;
will generate a message-source of Hi {$PH}! There are {$PH_1} items.`
And providing meaning, descrip, and ID:
$localize`:meaning|description##id:source message text`;
$localize`:meaning|:source message text`;
$localize`:description:source message text`;
$localize`:##id:source message text`;
This example from lokalise.com works:
const company = "Google";
const created_by = $localize`Created by ${company}`;
in my XLIF translation file:
<trans-unit id="3990133897753911565" datatype="html">
<source>Created by <x id="PH"/></source>
<target>Creado por... <x id="PH"/></target>
</trans-unit>
This DOESN'T WQRK:
Yet when I try to reproduce the same syntax with another i18 term - it DOESN'T WORK. It only pulls the English phrase, not the Spanish one.
const company = "Google";
const createdByCompany = $localize`Created by this person ${company}`;
<trans-unit id="spanishTest123" datatype="html">
<source>Created by this person <x id="PH"/></source>
<target>Creado por esta persona <x id="PH"/></target>
</trans-unit>
FYI: for the example that does work, if I REMOVE id="3990133897753911565", then it does NOT pull that translation. So clearly this id makes it happen - yet in my 2nd example I cannot get it to work.
*** UPDATE ***
Using the Angular extract tool produces the XLF file in the required xml format (it parses all i18n tags in your html temples, and the $localize calls in your component code). Run in your app's root dir as follows ng extract-i18n --output-path src/locale - then check the messages.xlf file in the locale folder.
So as per the docs, the "pre-pending it with a colon" syntax did work - https://angular.io/api/localize/init/$localize
const msg = $localize`:Password Reset Modal|Min num of chars##passwordNumChars:Must be at least ${setting.SettingValue}:minLen: characters long.`;
Notice how I updated the trans-unit "id" attrib in the xlf - i.e. my custom ID is "passwordNumChars".
<trans-unit id="passwordNumChars" datatype="html">
<source>Must be at least <x id="minLen" equiv-text="setting.SettingValue"/> characters long.</source>
<target>Debe contener al menos <x id="minLen" equiv-text="setting.SettingValue"/> caracteres.</target>
<note priority="1" from="meaning">password edit modal</note>
</trans-unit>
One final note: if you have the $localize function setup in your ts code - but you can't figure out the xlf format - you can use ng extract-i18n --output-path src/locale from a cmd line to generate the appropriate xlf file.
Then just copy/paste the section you need into your locale file; also perhaps into whatever translation software you're using as the source of truth (i.e. poedit.com to store all i18n terms).
I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)
For example say I want to sign a cert with an arbitrary or deprecated extension (nsCertType for example): https://www.openssl.org/docs/manmaster/man5/x509v3_config.html
I believe I'm supposed to add the arbitrary extension as part of the certificate as per below but how / where do you discover the asn1 object identifier? I've read more documentation that I care to admit today and am still stumped.
tmpl := &x509.Certificate{
SerialNumber: big.NewInt(time.Now().Unix()*1000),
Subject: pkix.Name{CommonName: "edgeproxy", Organization: []string{"edgeproxy"}},
NotBefore: now,
NotAfter: now.Add(caMaxAge),
ExtraExtensions: []pkix.Extension{
{
Id: asn1.ObjectIdentifier{}, //what goes here
Critical: false,
[]byte("sslCA"),
},
},
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth,x509.ExtKeyUsageClientAuth,x509.ExtKeyUsageEmailProtection, x509.ExtKeyUsageTimeStamping, x509.ExtKeyUsageMicrosoftCommercialCodeSigning, x509.ExtKeyUsageMicrosoftServerGatedCrypto, x509.ExtKeyUsageNetscapeServerGatedCrypto} ,
KeyUsage: x509.KeyUsageCRLSign | x509.KeyUsageCertSign,
IsCA: true,
BasicConstraintsValid: true,
}
In python I would do this but don't know how to port this into go (which is what I'm doing at the end of the day):
OpenSSL.crypto.X509Extension(
b"nsCertType",
False,
b"sslCA"
),
Go sources at https://golang.org/src/encoding/asn1/asn1.go define:
// An ObjectIdentifier represents an ASN.1 OBJECT IDENTIFIER.
type ObjectIdentifier []int
So the object identifier (OID for short) is an array of integers. The asn1 module has methods to parse them, like parseObjectIdentifier.
This is the structure you need to put after the Id: attribute.
But now you need to find out the OID you want.
While difficult to read, OpenSSL source code can show you OIDs of many things in the X.400/X.500/X.509 worlds, or at least those known by OpenSSL.
If you go to https://github.com/openssl/openssl/blob/1aec7716c1c5fccf605a46252a46ea468e684454/crypto/objects/obj_dat.h
and searching on nsCertType you get:
{"nsCertType", "Netscape Cert Type", NID_netscape_cert_type, 9, &so[407]},
so is defined previously, and if you jump at its 407th item you see:
0x60,0x86,0x48,0x01,0x86,0xF8,0x42,0x01,0x01, /* [ 407] OBJ_netscape_cert_type */
and doing a final search on OBJ_netscape_cert_type in same file gives:
71, /* OBJ_netscape_cert_type 2 16 840 1 113730 1 1 */
which means the corresponding OID is 2.16.840.1.113730.1.1
Or you can decode the above list of integers that describe this OID (see How does ASN.1 encode an object identifier? for details).
first 0x60 is 9610 so 2*40 + 16, which means the OID starts with 2.16.
then each other one is in "base128" form: if most significant bit is 1 combine the 7 least significant bits together of all following numbers until one has 0 as most significant bit
0x86 is 100001102 so has to go with 0x48 aka 010010002 so it is in fact 000011010010002 or 84010
0x01 is less than 128 so it is itself, 1
0x86 is still 100001102 but has to be paired with both 0xF8 (111110002) and 0x42 (010000102 and we stop here since first bit is 0) so 0000110111100010000102 altogether or 11373010
and the two last 0x01 are themselves, 1.
so we do get again 2.16.840.1.113730.1.1
You can double check it at some online OID browser like here:
http://oid-info.com/cgi-bin/display?oid=2.16.840.1.113730.1.1&action=display
that gives the following description for it:
Netscape certificate type (a Rec. ITU-T X.509 v3 certificate extension
used to identify whether the certificate subject is a Secure Sockets
Layer (SSL) client, an SSL server or a Certificate Authority (CA))
You can then even browse various arcs, like the netscape one, or others, to find out other OIDs.
You also get the full ASN.1 notation:
{joint-iso-itu-t(2) country(16) us(840) organization(1) netscape(113730) cert-ext(1) cert-type(1)}
The Language Analysis framework is deprecated and its not even available in 64-bit. The documentation says - use CFStringTokenizer but the tokenizer doesn't provide functionalities available in lang analysis framework.
What is the replacement for morpheme analysis APIs that lang analysis framework provided?
EDIT:
Though Pantong's reply helped but it doesn't work in all cases, e.g. for words with 3-4 kanji characters it returns incorrect result. (By incorrect I mean its not same as what it returned by Lang analysis framework API for same string).
a) 現人神 is converted to latin - 'gen ren shen' and in hiragana- 'げんじんしん' whereas it should be - in latin - 'Arahitogami ' and in hiragana- 'あらひとがみ'
b) 安本丹 is converted to latin - 'an ben dan' and in hiragana- 'やすもとまこと' whereas it should be - in latin as - 'Yasumoto makoto ' and in hiragana- 'あんぽんたん'
One feature the deprecated morpheme analysis APIs has is "getting rudy text for Japanese/Chinese text". If you asking the replacement for that particular feature, then the following code is an example. However, I don't know about the replacement for other features in morpheme analysis APIs.
CFStringRef testString = CFSTR("のちに検知されたトークンの範囲用として使用");
CFStringTokenizerRef tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
testString,
CFRangeMake(0, CFStringGetLength(testString)),
kCFStringTokenizerUnitWordBoundary,
CFLocaleCreate(kCFAllocatorDefault, CFSTR("Japanese")));
do
{
if (CFStringTokenizerAdvanceToNextToken(tokenizer) == kCFStringTokenizerTokenNone) {
break;
}
CFStringRef originalToken = CFStringCreateWithSubstring(kCFAllocatorDefault,
testString,
CFStringTokenizerGetCurrentTokenRange(tokenizer));
// Get Latin transcription from the Japanese text
CFMutableStringRef convertedToken = (CFMutableStringRef)CFStringTokenizerCopyCurrentTokenAttribute(tokenizer,
kCFStringTokenizerAttributeLatinTranscription);
NSLog(#"token: %# -> latin: %#", originalToken, convertedToken);
// Get kana from Latin transcription
CFStringTransform(convertedToken, NULL, kCFStringTransformLatinHiragana, false);
NSLog(#"token: %# -> latin: %#", originalToken, convertedToken);
}
while (true);